Content
- Overview
- Multicore Processor
- Direct Connections
- On-Chip Networks
- Shared Memory Cores
- Symmetric Multiprocessors
- Synchronization
- Sequential Consistency
- Memory Fences
- Memory Coherence
- Cache Coherence v.s. Memory Consistency
- Example: Parallel I/O
- Intervention
- False Sharing
- Out-of-Order Loads/Stores & CC
Overview
- Instruction level parallelism
- Dynamic: Super-scalar processor, OOO execution
- Static: VLIW
- Data level parallelism
- Vector machines, SIMD extensions
- Thread level parallelism
Multicore Processors
Uniprocessor speed limited by power wall -> multicore
Direct Connections
- Low latency, high throughput, point-to-point network between processors
- Bypass I/O subsystems
- Low latency between neighboring processors
- Sometimes dedicated machine instructions
- Multi-hop routing between further processors
- Often tie to distributed system design
- Often proprietary design
On-Chip Networks
- Building network for system-on-chip
- Complete computer system on chip, including graphs, peripheral and memory controllers, accelerators
- MPSoC (Multi-processor system on a chip)
- Multiple compute core in the system
- Mostly proprietary
Shared Memory Cores
Symmetric Multiprocessors
Each processor equally far away from memory; any processor can do any I/O.
Synchronization
Producer-consumer
Mutual exclusion
Sequential Consistency
A system is sequentially consistent if the result of any execution = all processors executed in sequential order, and operations of each processor appear in the order specified by the program.
Sequential consistency = arbitrary order-preserving interleaving of memory references of sequential programs
Imposes more memory ordering constraints than those imposed by uniprocessor program dependencies.
- Issues
- Out-of-order execution capability
- Caches: store not seen by other processors
No common commercial architecture has a sequentially consistent memory model.
Memory Fences
Processors with relaxed/weak memory models (permit load/store to different addresses to be reordered) need to provide memory fence instructions to force serialization of memory accesses.
Expensive, but cost only paid when needed.
Memory Coherence
Both write-back and write-through may still cause memory inconsistency.
- H/W support
- Only 1 processor has write permission to a memory location at a time
- No processor can load a stale copy
Cache Coherence v.s. Memory Consistency
- Cache coherence protocol
- Ensures all writes by 1 processor are eventually visible to others, for 1 memory address
- Not enough to ensure sequential consistency
- Memory consistency model
- Gives rules when a write will be visible to other's read, across different addresses
- Cache coherence protocol + processor memory reorder buffer -> implement memory consistency model
Example: Parallel I/O
DMA: direct memory access, I/O can read/write memory autonomously from CPU
- Problem
- Memory - disk
- Disk - memory
- Solution: snoopy cache
- Cache watch DMA transfers
- Tags are dual-ported
Observed Bus Cycle | Cache State | Cache Action |
---|---|---|
DMA read memory -> disk | Not cached | N/A |
Cached, not modified | N/A | |
Cached, modified | Cache intervene | |
DMA write disk -> memory | Not cached | N/A |
Cached, not modified | Cache purge copy | |
Cached, modified | ??? |
- Snoopy cache coherence protocol
- Write miss
- Invalidate address in all other caches
- Read miss
- Force write-back from dirty cache to memory
- Write miss
- Cache state transition diagram
- MSI
- MESI
Intervention
When a cache places a read request on the bus to read from another cache, memory holding stale data may also respond to the request.
The cache needs to intervene through memory controller to supply the correct data.
False Sharing
2 caches read different words from the same cache line.
Out-of-Order Loads/Stores & CC
- Blocking caches
- One request at a time + CC => SC
- Non-blocking caches
- Multiple requests (different address) + CC => relaxed memory models