Content

Big Picture of Architecture & Memory Access
1. Virtual Memory
  - Memory Access
2. Multithreading Basics
  - Parallelism within Processor
3. Connection between Memory Access, Memory Allocation and Cache Efficiency
Cache Coherence
1. Cache Coherence Overview
2. CC Protocols
3. Locality in Parallel Programming
4. True & False Sharing

Big Picture of Architecture & Memory Access

Virtual Memory

Protection: each process has its own private memory space
Sharing: processes can share physical memory frames
Hide fragmentation: can run if not enough total/contiguous memory

Memory Access

Processor sends VA to MMU
MMU fetches PTE from PT in memory
PTE absent, MMU triggers page fault exception
Handler identifies victim page (if dirty, page it out to disk)
Handler pages in new page; update PTE
Handler returns to original process; restart faulting instruction

Multithreading Basics

linear increase -> increase slowdown -> degrade

Parallelism within Processor

Example x = x * (data[i] * data[i+1])
Limitations
- Number of functional units
- Latency of operation causing dependency
- Number of registers to hold temporaries

Connection between Memory Access, Memory Allocation and Cache Efficiency

Dynamic memory allocation
- Space utilization
- Time complexity on malloc/free
- Memory-wall: growing disparity of CPU & RAM speeds
  - Caching effectiveness important
  - Padding not cache-friendly: avoid internal fragmentation
False sharing: occurs when >= 2 processors access different data in same cache line, and at least one of them writes
- Memory arenas: 1 thread only use memory from a single arena (continuous blocks of memory)

Cache Coherence

Cache Coherence Overview

Shared memory machines
- Small # of processors: shared memory with coherent caches (SMP)
- Large # of processors: distributed shared memory with coherent caches (CC-NUMA)
Caches in multiprocessors
- Same line appears in >= 2 caches, one write, others read

CC Protocols

MSI Protocol

# Input/Action
                PrWr/BuRdX
    ⬈-----------------------------------⬊
      PrRd/BuRd     PrRd/-    PrWr/BuRdX
I ------------------> S ------------------> M
  <------------------   <------------------
        BuRdX/-     BuRd/-    BuRd/Flush
    ⬉-----------------------------------⬋
                BuRdX/Flush

Assumption: bus based architecture
- Bus is reliable, ordered broadcast bus (snooping bus)
States of a cache line
- Invalid
- Shared: one of many cached copies
- Modified: sole valid copy
Processor events
- PrRd
- PrWr
Bus Transactions
- BusRd: simply asks for copy
- BusRdX: asks for copy to modify
- Flush: updates memory
Actions
- Update state, perform bus transaction, flush value onto bus
Problem
- Reading & modifying data is 2 bus xactions (BusRd(I->S) + BusRdX)
- A write to shared will generate invalidation request

MESI Protocol

One more state
- Exclusive/Exclusive-clean: only this cache has copy but not modified
- A write to exclusive will not generate invalidation request
Typically built on write-back caches

Locality in Parallel Programming

Cache-aware access
- Cache invalidation traffic important
Awareness of data placement in memory
- Important for CC-NUMA because long memory latencies for non-local memory access
Awareness of data assignment to threads when load balancing

for (i = 0; i < n; i++) 
    a[i] = i;
#pragma omp parallel for
for (i = 0; i < n; i++)
    b[i] = f(a[i-1], a[i]);

-> better locality
#pragma omp parallel for
for (i = 0; i < n; i++) 
    a[i] = i;
#pragma omp parallel for
for (i = 0; i < n; i++)
    b[i] = f(a[i-1], a[i]);

True sharing: 2 threads accessing same locations on one block
False sharing: 2 threads accessing distinct locations on one block
- Block will ping-pong
  - Ensure they map to separate cache blocks e.g. insert padding
- Example
```
  # processor = 2, cache line fit 8 elements
  #pragma omp parallel for schedule(cyclic) 
  for (i = 0; i < n; i++)
      a[i] = b[i];
```

Threads & Parallel Programming III

Content

Big Picture of Architecture & Memory Access

Virtual Memory

Memory Access

Multithreading Basics

Parallelism within Processor

Connection between Memory Access, Memory Allocation and Cache Efficiency

Cache Coherence

Cache Coherence Overview

CC Protocols

MSI Protocol

MESI Protocol

Locality in Parallel Programming

results matching ""

No results matching ""

Content

Big Picture of Architecture & Memory Access

Virtual Memory

Memory Access

Multithreading Basics

Parallelism within Processor

Connection between Memory Access, Memory Allocation and Cache Efficiency

Cache Coherence

Cache Coherence Overview

CC Protocols

MSI Protocol

MESI Protocol

Locality in Parallel Programming

True & False Sharing

results matching ""

No results matching ""