Content

Cache
1. Typical Memory Reference Patterns
2. Cache Algorithm
3. Cache Design
4. Cache Organization
5. Replacement Policy
6. Cache Miss
  1. Performance
  2. Average Memory Access Time (AMAT)
  3. CPU-Cache Interaction
  4. Write Policies
7. Multilevel Caches
  1. Inclusion Policy
8. Prefetching
  1. Hardware Prefetching
  2. Software Prefetching
  3. Compiler Optimizations

Cache

Typical Memory Reference Patterns

Temporal locality
Spatial locality

Cache Algorithm

Match tag
- If match
  - Return data from cache
- Else
  - Read data from memory
  - Update cache
  - Return data

Cache Design

Capacity
- Search efficiency (smaller better)
- Power consumption (smaller better)
Line size
- Tag overhead (bigger better)
- Burst transfer (bigger better)
- Conflicts (smaller better)
- Bandwidth wastage (smaller better)
Cache organization
Replacement policy
Read/write policies

Cache Organization

Fully associative
- Cache line can be stored in any location
- |tag|word|byte| - word: line size
Direct map
- Cache line can be stored in only 1 location
- Simple & fast (used in L1 cache)
- |tag|index|word|byte| - index: entry size
Set associative
- Cache line can be stored in any 1 of N locations
- |tag|index|word|byte|

Replacement Policy

Random
Least recently used (LRU): true implementation for small sets; pseudo-LRU for larger sets
FIFO (round-robin): used in highly associative caches
Not-most-recently-used (NMRU): FIFO with exceptions for most recently used cache lines

PseudoLRU

Challenge for true LRU

Storage

Comparison

Relaxation

Find one of the older to evict

Implementation

Set 1 bit for each line when accessed

Evict any one of the lines with flag 0

Periodically reset all to 0

Advanced

Evict lines not dirty when there's a draw

Cache Miss

Compulsory: first reference to line
Capacity: occurs even under perfect replacement policy
Conflict: due to line-placement strategy

Performance

Larger cache size
- O: reduce capacity & conflict misses
- X: increase hit time
Higher associativity
- O: reduce conflict misses
- X: increase hit time
Larger line size
- O: reduce compulsory & capacity misses
- X: increase miss penalty & conflict misses

Higher associativity affects small caches more

Block size increases:
- Number of blocks held decreases -> conflicts
- Miss penalty increases
- Reduce tag overhead

Average Memory Access Time (AMAT)

AMAT = Hit Time + Miss Rate * Miss Penalty

Hit time (critical path time) needs to be kept within 1 cycle

CPU-Cache Interaction

Write Policies

Cache hit
- Write through: write both cache & memory
  - Higher traffic
  - Simple design
  - Can write to write buffer and write to DRAM offline
- Write back: write cache; write memory when evicted
  - Dirty bit
Cache miss
- Write allocate: only write memory
  - Write back must be used with write allocate; write through may or may not
- No write allocate: fetch into cache & write cache

Read Miss with Write Buffer

Solutions

Flush buffer before read

Check write buffer, find latest write data if address match

Multilevel Caches

Local miss rate = misses in cache / accesses to cache Global miss rate = misses in cache / CPU memory accesses Misses per instruction = misses in cache / number of instructions

L2
- Use smaller L1 if L2 present
  - Reduce hit rate (but increase miss rate)
  - Reduce miss penalty (backup in L2)
  - Reduce average access energy
- Use write-through for L1
  - At most 1 miss request per L1 access -> simpler pipeline control
  - Write-back L2 absorbs traffic, doesn't go off-ship
  - Simplifies coherence issues
  - Simplifies error recovery (parity bit)
- Usually higher capacity & higher associativity
  - Not included in critical path, slower performance allowed
Unified cache
- Less chance for both I-cache & D-cache to access at the same time
- Save H/W & energy
- Load balancing
- Cache pollution

Inclusion Policy

Inclusive multilevel cache
- Inner cache lines also in outer cache
- External coherence snoop access need only check outer cache
- Capacity determined by lower level caches
Exclusive multilevel cache
- Inner cache lines not necessarily in outer cache
- Swap lines between 2 caches on miss -> uses higher bandwidth
- Easier flush & cleanup

Prefetching

Reduce compulsory misses; increases conflict misses.

Issues
- Usefulness
- Timeliness
- Cache & bandwidth pollution

Hardware Prefetching

Hardware instruction prefetching
- Fetch 2 consecutive lines on miss
- Place requested line in cache, another in stream buffer
- If miss in cache but hit in stream buffer, move stream buffer line into cache & prefetch next line
Hardware data prefetching
- Prefetch-on-miss: prefetch b+1 upon miss on b
- One-block-lookahead scheme (OBL): prefetch line b+1 upon miss on line b
- Strided prefetch: prefetch b+mN if pattern of b, b+N, ... are observed

Software Prefetching

for(i=0; i < N; i++) {
    prefetch( &a[i + P] );
    prefetch( &b[i + P] );
    SUM = SUM + a[i] * b[i];
}

Timing the biggest issue
- Too close to data used -> too late
- Too early -> pollution
- Hard to estimate how long the data is fetched

Compiler Optimizations

Restructuring code affects data access sequence
- Group data accesses -> spatial locality
- Reorder data accesses -> temporal locality
Prevent data from entering the cache
Kill data that will never be used again
Example
- Loop interchange -> spatial locality
- Loop fusion - temporal locality
- Cache tiling

Cache

Content

Cache

Typical Memory Reference Patterns

Cache Algorithm

Cache Design

Cache Organization

Replacement Policy

Cache Miss

Performance

Average Memory Access Time (AMAT)

CPU-Cache Interaction

Write Policies

Multilevel Caches

Inclusion Policy

Prefetching

Hardware Prefetching

Software Prefetching

Compiler Optimizations

results matching ""

No results matching ""