Content
- Cache
- Typical Memory Reference Patterns
- Cache Algorithm
- Cache Design
- Cache Organization
- Replacement Policy
- Cache Miss
- Performance
- Average Memory Access Time (AMAT)
- CPU-Cache Interaction
- Write Policies
- Multilevel Caches
- Inclusion Policy
- Prefetching
- Hardware Prefetching
- Software Prefetching
- Compiler Optimizations
Cache
Typical Memory Reference Patterns
- Temporal locality
- Spatial locality
Cache Algorithm
- Match tag
- If match
- Return data from cache
- Else
- Read data from memory
- Update cache
- Return data
- If match
Cache Design
- Capacity
- Search efficiency (smaller better)
- Power consumption (smaller better)
- Line size
- Tag overhead (bigger better)
- Burst transfer (bigger better)
- Conflicts (smaller better)
- Bandwidth wastage (smaller better)
- Cache organization
- Replacement policy
- Read/write policies
Cache Organization
- Fully associative
- Cache line can be stored in any location
|tag|word|byte| - word: line size
- Direct map
- Cache line can be stored in only 1 location
- Simple & fast (used in L1 cache)
|tag|index|word|byte| - index: entry size
- Set associative
- Cache line can be stored in any 1 of N locations
|tag|index|word|byte|
Replacement Policy
- Random
- Least recently used (LRU): true implementation for small sets; pseudo-LRU for larger sets
- FIFO (round-robin): used in highly associative caches
- Not-most-recently-used (NMRU): FIFO with exceptions for most recently used cache lines
PseudoLRU
- Challenge for true LRU
- Storage
- Comparison
- Relaxation
- Find one of the older to evict
- Implementation
- Set 1 bit for each line when accessed
- Evict any one of the lines with flag 0
- Periodically reset all to 0
- Advanced
- Evict lines not dirty when there's a draw
Cache Miss
- Compulsory: first reference to line
- Capacity: occurs even under perfect replacement policy
- Conflict: due to line-placement strategy
Performance
- Larger cache size
- O: reduce capacity & conflict misses
- X: increase hit time
- Higher associativity
- O: reduce conflict misses
- X: increase hit time
- Larger line size
- O: reduce compulsory & capacity misses
- X: increase miss penalty & conflict misses
- Higher associativity affects small caches more
- Block size increases:
- Number of blocks held decreases -> conflicts
- Miss penalty increases
- Reduce tag overhead
Average Memory Access Time (AMAT)
AMAT = Hit Time + Miss Rate * Miss Penalty
- Hit time (critical path time) needs to be kept within 1 cycle
CPU-Cache Interaction
Write Policies
- Cache hit
- Write through: write both cache & memory
- Higher traffic
- Simple design
- Can write to write buffer and write to DRAM offline
- Write back: write cache; write memory when evicted
- Dirty bit
- Write through: write both cache & memory
- Cache miss
- Write allocate: only write memory
- Write back must be used with write allocate; write through may or may not
- No write allocate: fetch into cache & write cache
- Write allocate: only write memory
Read Miss with Write Buffer
- Solutions
- Flush buffer before read
- Check write buffer, find latest write data if address match
Multilevel Caches
Local miss rate = misses in cache / accesses to cache
Global miss rate = misses in cache / CPU memory accesses
Misses per instruction = misses in cache / number of instructions
- L2
- Use smaller L1 if L2 present
- Reduce hit rate (but increase miss rate)
- Reduce miss penalty (backup in L2)
- Reduce average access energy
- Use write-through for L1
- At most 1 miss request per L1 access -> simpler pipeline control
- Write-back L2 absorbs traffic, doesn't go off-ship
- Simplifies coherence issues
- Simplifies error recovery (parity bit)
- Usually higher capacity & higher associativity
- Not included in critical path, slower performance allowed
- Use smaller L1 if L2 present
- Unified cache
- Less chance for both I-cache & D-cache to access at the same time
- Save H/W & energy
- Load balancing
- Cache pollution
Inclusion Policy
- Inclusive multilevel cache
- Inner cache lines also in outer cache
- External coherence snoop access need only check outer cache
- Capacity determined by lower level caches
- Exclusive multilevel cache
- Inner cache lines not necessarily in outer cache
- Swap lines between 2 caches on miss -> uses higher bandwidth
- Easier flush & cleanup
Prefetching
Reduce compulsory misses; increases conflict misses.
- Issues
- Usefulness
- Timeliness
- Cache & bandwidth pollution
Hardware Prefetching
Hardware instruction prefetching
- Fetch 2 consecutive lines on miss
- Place requested line in cache, another in stream buffer
- If miss in cache but hit in stream buffer, move stream buffer line into cache & prefetch next line
- Hardware data prefetching
- Prefetch-on-miss: prefetch
b+1
upon miss onb
- One-block-lookahead scheme (OBL): prefetch line
b+1
upon miss on lineb
- Strided prefetch: prefetch
b+mN
if pattern ofb
,b+N
, ... are observed
- Prefetch-on-miss: prefetch
Software Prefetching
for(i=0; i < N; i++) {
prefetch( &a[i + P] );
prefetch( &b[i + P] );
SUM = SUM + a[i] * b[i];
}
- Timing the biggest issue
- Too close to data used -> too late
- Too early -> pollution
- Hard to estimate how long the data is fetched
Compiler Optimizations
- Restructuring code affects data access sequence
- Group data accesses -> spatial locality
- Reorder data accesses -> temporal locality
- Prevent data from entering the cache
- Kill data that will never be used again
- Example
- Loop interchange -> spatial locality
- Loop fusion - temporal locality
- Cache tiling