Content

  1. Overview
  2. Multicore Processor
    1. Direct Connections
    2. On-Chip Networks
    3. Shared Memory Cores
    4. Symmetric Multiprocessors
    5. Synchronization
      1. Sequential Consistency
      2. Memory Fences
    6. Memory Coherence
      1. Cache Coherence v.s. Memory Consistency
      2. Example: Parallel I/O
    7. Intervention
    8. False Sharing
    9. Out-of-Order Loads/Stores & CC

Overview

  • Instruction level parallelism
    • Dynamic: Super-scalar processor, OOO execution
    • Static: VLIW
  • Data level parallelism
    • Vector machines, SIMD extensions
  • Thread level parallelism

Multicore Processors

Uniprocessor speed limited by power wall -> multicore

Direct Connections

  • Low latency, high throughput, point-to-point network between processors
    • Bypass I/O subsystems
  • Low latency between neighboring processors
    • Sometimes dedicated machine instructions
  • Multi-hop routing between further processors
  • Often tie to distributed system design
  • Often proprietary design

On-Chip Networks

  • Building network for system-on-chip
    • Complete computer system on chip, including graphs, peripheral and memory controllers, accelerators
  • MPSoC (Multi-processor system on a chip)
    • Multiple compute core in the system
  • Mostly proprietary

Shared Memory Cores

Symmetric Multiprocessors

Each processor equally far away from memory; any processor can do any I/O.

Synchronization

  • Producer-consumer

  • Mutual exclusion

Sequential Consistency

A system is sequentially consistent if the result of any execution = all processors executed in sequential order, and operations of each processor appear in the order specified by the program.

Sequential consistency = arbitrary order-preserving interleaving of memory references of sequential programs

Imposes more memory ordering constraints than those imposed by uniprocessor program dependencies.

  • Issues
    • Out-of-order execution capability
    • Caches: store not seen by other processors

No common commercial architecture has a sequentially consistent memory model.

Memory Fences

Processors with relaxed/weak memory models (permit load/store to different addresses to be reordered) need to provide memory fence instructions to force serialization of memory accesses.
Expensive, but cost only paid when needed.

Memory Coherence

Both write-back and write-through may still cause memory inconsistency.

  • H/W support
    • Only 1 processor has write permission to a memory location at a time
    • No processor can load a stale copy
Cache Coherence v.s. Memory Consistency
  • Cache coherence protocol
    • Ensures all writes by 1 processor are eventually visible to others, for 1 memory address
    • Not enough to ensure sequential consistency
  • Memory consistency model
    • Gives rules when a write will be visible to other's read, across different addresses
  • Cache coherence protocol + processor memory reorder buffer -> implement memory consistency model
Example: Parallel I/O

DMA: direct memory access, I/O can read/write memory autonomously from CPU

  • Problem
    • Memory - disk
    • Disk - memory
  • Solution: snoopy cache
    • Cache watch DMA transfers
    • Tags are dual-ported
Observed Bus Cycle Cache State Cache Action
DMA read memory -> disk Not cached N/A
Cached, not modified N/A
Cached, modified Cache intervene
DMA write disk -> memory Not cached N/A
Cached, not modified Cache purge copy
Cached, modified ???

  • Snoopy cache coherence protocol
    • Write miss
      • Invalidate address in all other caches
    • Read miss
      • Force write-back from dirty cache to memory
  • Cache state transition diagram
    • MSI
    • MESI

Intervention

When a cache places a read request on the bus to read from another cache, memory holding stale data may also respond to the request.
The cache needs to intervene through memory controller to supply the correct data.

False Sharing

2 caches read different words from the same cache line.

Out-of-Order Loads/Stores & CC

  • Blocking caches
    • One request at a time + CC => SC
  • Non-blocking caches
    • Multiple requests (different address) + CC => relaxed memory models

results matching ""

    No results matching ""