Content
- Big Picture of Architecture & Memory Access
- Virtual Memory
- Memory Access
- Multithreading Basics
- Parallelism within Processor
- Connection between Memory Access, Memory Allocation and Cache Efficiency
- Virtual Memory
- Cache Coherence
- Cache Coherence Overview
- CC Protocols
- Locality in Parallel Programming
- True & False Sharing
Big Picture of Architecture & Memory Access
Virtual Memory
- Protection: each process has its own private memory space
- Sharing: processes can share physical memory frames
- Hide fragmentation: can run if not enough total/contiguous memory
Memory Access
- Processor sends VA to MMU
- MMU fetches PTE from PT in memory
- PTE absent, MMU triggers page fault exception
- Handler identifies victim page (if dirty, page it out to disk)
- Handler pages in new page; update PTE
- Handler returns to original process; restart faulting instruction
Multithreading Basics
linear increase -> increase slowdown -> degrade
Parallelism within Processor
- Example
x = x * (data[i] * data[i+1])
- Limitations
- Number of functional units
- Latency of operation causing dependency
- Number of registers to hold temporaries
Connection between Memory Access, Memory Allocation and Cache Efficiency
- Dynamic memory allocation
- Space utilization
- Time complexity on malloc/free
- Memory-wall: growing disparity of CPU & RAM speeds
- Caching effectiveness important
- Padding not cache-friendly: avoid internal fragmentation
- False sharing: occurs when >= 2 processors access different data in same cache line, and at least one of them writes
- Memory arenas: 1 thread only use memory from a single arena (continuous blocks of memory)
Cache Coherence
Cache Coherence Overview
- Shared memory machines
- Small # of processors: shared memory with coherent caches (SMP)
- Large # of processors: distributed shared memory with coherent caches (CC-NUMA)
- Caches in multiprocessors
- Same line appears in >= 2 caches, one write, others read
CC Protocols
MSI Protocol
# Input/Action
PrWr/BuRdX
⬈-----------------------------------⬊
PrRd/BuRd PrRd/- PrWr/BuRdX
I ------------------> S ------------------> M
<------------------ <------------------
BuRdX/- BuRd/- BuRd/Flush
⬉-----------------------------------⬋
BuRdX/Flush
- Assumption: bus based architecture
- Bus is reliable, ordered broadcast bus (snooping bus)
- States of a cache line
- Invalid
- Shared: one of many cached copies
- Modified: sole valid copy
- Processor events
- PrRd
- PrWr
- Bus Transactions
- BusRd: simply asks for copy
- BusRdX: asks for copy to modify
- Flush: updates memory
- Actions
- Update state, perform bus transaction, flush value onto bus
- Problem
- Reading & modifying data is 2 bus xactions (BusRd(I->S) + BusRdX)
- A write to shared will generate invalidation request
MESI Protocol
- One more state
- Exclusive/Exclusive-clean: only this cache has copy but not modified
- A write to exclusive will not generate invalidation request
- Typically built on write-back caches
Locality in Parallel Programming
- Cache-aware access
- Cache invalidation traffic important
- Awareness of data placement in memory
- Important for CC-NUMA because long memory latencies for non-local memory access
- Awareness of data assignment to threads when load balancing
for (i = 0; i < n; i++)
a[i] = i;
#pragma omp parallel for
for (i = 0; i < n; i++)
b[i] = f(a[i-1], a[i]);
-> better locality
#pragma omp parallel for
for (i = 0; i < n; i++)
a[i] = i;
#pragma omp parallel for
for (i = 0; i < n; i++)
b[i] = f(a[i-1], a[i]);
True & False Sharing
- True sharing: 2 threads accessing same locations on one block
False sharing: 2 threads accessing distinct locations on one block
- Block will ping-pong
- Ensure they map to separate cache blocks e.g. insert padding
Example
# processor = 2, cache line fit 8 elements #pragma omp parallel for schedule(cyclic) for (i = 0; i < n; i++) a[i] = b[i];
- Block will ping-pong