Content

  1. Cache
    1. Typical Memory Reference Patterns
    2. Cache Algorithm
    3. Cache Design
    4. Cache Organization
    5. Replacement Policy
    6. Cache Miss
      1. Performance
      2. Average Memory Access Time (AMAT)
      3. CPU-Cache Interaction
      4. Write Policies
    7. Multilevel Caches
      1. Inclusion Policy
    8. Prefetching
      1. Hardware Prefetching
      2. Software Prefetching
      3. Compiler Optimizations

Cache

Typical Memory Reference Patterns

  • Temporal locality
  • Spatial locality

Cache Algorithm

  • Match tag
    • If match
      • Return data from cache
    • Else
      • Read data from memory
      • Update cache
      • Return data

Cache Design

  • Capacity
    • Search efficiency (smaller better)
    • Power consumption (smaller better)
  • Line size
    • Tag overhead (bigger better)
    • Burst transfer (bigger better)
    • Conflicts (smaller better)
    • Bandwidth wastage (smaller better)
  • Cache organization
  • Replacement policy
  • Read/write policies

Cache Organization

  • Fully associative
    • Cache line can be stored in any location
    • |tag|word|byte| - word: line size
  • Direct map
    • Cache line can be stored in only 1 location
    • Simple & fast (used in L1 cache)
    • |tag|index|word|byte| - index: entry size
  • Set associative
    • Cache line can be stored in any 1 of N locations
    • |tag|index|word|byte|

Replacement Policy

  • Random
  • Least recently used (LRU): true implementation for small sets; pseudo-LRU for larger sets
  • FIFO (round-robin): used in highly associative caches
  • Not-most-recently-used (NMRU): FIFO with exceptions for most recently used cache lines

PseudoLRU

  • Challenge for true LRU
    • Storage
    • Comparison
  • Relaxation
    • Find one of the older to evict
  • Implementation
    • Set 1 bit for each line when accessed
    • Evict any one of the lines with flag 0
    • Periodically reset all to 0
  • Advanced
    • Evict lines not dirty when there's a draw

Cache Miss

  • Compulsory: first reference to line
  • Capacity: occurs even under perfect replacement policy
  • Conflict: due to line-placement strategy
Performance
  • Larger cache size
    • O: reduce capacity & conflict misses
    • X: increase hit time
  • Higher associativity
    • O: reduce conflict misses
    • X: increase hit time
  • Larger line size
    • O: reduce compulsory & capacity misses
    • X: increase miss penalty & conflict misses

  • Higher associativity affects small caches more

  • Block size increases:
    • Number of blocks held decreases -> conflicts
    • Miss penalty increases
    • Reduce tag overhead
Average Memory Access Time (AMAT)

AMAT = Hit Time + Miss Rate * Miss Penalty

  • Hit time (critical path time) needs to be kept within 1 cycle
CPU-Cache Interaction

Write Policies
  • Cache hit
    • Write through: write both cache & memory
      • Higher traffic
      • Simple design
      • Can write to write buffer and write to DRAM offline
    • Write back: write cache; write memory when evicted
      • Dirty bit
  • Cache miss
    • Write allocate: only write memory
      • Write back must be used with write allocate; write through may or may not
    • No write allocate: fetch into cache & write cache

Read Miss with Write Buffer

  • Solutions
    • Flush buffer before read
    • Check write buffer, find latest write data if address match

Multilevel Caches

Local miss rate = misses in cache / accesses to cache Global miss rate = misses in cache / CPU memory accesses Misses per instruction = misses in cache / number of instructions

  • L2
    • Use smaller L1 if L2 present
      • Reduce hit rate (but increase miss rate)
      • Reduce miss penalty (backup in L2)
      • Reduce average access energy
    • Use write-through for L1
      • At most 1 miss request per L1 access -> simpler pipeline control
      • Write-back L2 absorbs traffic, doesn't go off-ship
      • Simplifies coherence issues
      • Simplifies error recovery (parity bit)
    • Usually higher capacity & higher associativity
      • Not included in critical path, slower performance allowed
  • Unified cache
    • Less chance for both I-cache & D-cache to access at the same time
    • Save H/W & energy
    • Load balancing
    • Cache pollution
Inclusion Policy
  • Inclusive multilevel cache
    • Inner cache lines also in outer cache
    • External coherence snoop access need only check outer cache
    • Capacity determined by lower level caches
  • Exclusive multilevel cache
    • Inner cache lines not necessarily in outer cache
    • Swap lines between 2 caches on miss -> uses higher bandwidth
    • Easier flush & cleanup

Prefetching

Reduce compulsory misses; increases conflict misses.

  • Issues
    • Usefulness
    • Timeliness
    • Cache & bandwidth pollution
Hardware Prefetching
  • Hardware instruction prefetching

    • Fetch 2 consecutive lines on miss
    • Place requested line in cache, another in stream buffer
    • If miss in cache but hit in stream buffer, move stream buffer line into cache & prefetch next line
  • Hardware data prefetching
    • Prefetch-on-miss: prefetch b+1 upon miss on b
    • One-block-lookahead scheme (OBL): prefetch line b+1 upon miss on line b
    • Strided prefetch: prefetch b+mN if pattern of b, b+N, ... are observed
Software Prefetching
for(i=0; i < N; i++) {
    prefetch( &a[i + P] );
    prefetch( &b[i + P] );
    SUM = SUM + a[i] * b[i];
}
  • Timing the biggest issue
    • Too close to data used -> too late
    • Too early -> pollution
    • Hard to estimate how long the data is fetched

Compiler Optimizations

  • Restructuring code affects data access sequence
    • Group data accesses -> spatial locality
    • Reorder data accesses -> temporal locality
  • Prevent data from entering the cache
  • Kill data that will never be used again
  • Example
    • Loop interchange -> spatial locality
    • Loop fusion - temporal locality
    • Cache tiling

results matching ""

    No results matching ""