Content

  1. Big Picture of Architecture & Memory Access
    1. Virtual Memory
      • Memory Access
    2. Multithreading Basics
      • Parallelism within Processor
    3. Connection between Memory Access, Memory Allocation and Cache Efficiency
  2. Cache Coherence
    1. Cache Coherence Overview
    2. CC Protocols
    3. Locality in Parallel Programming
    4. True & False Sharing

Big Picture of Architecture & Memory Access

Virtual Memory

  • Protection: each process has its own private memory space
  • Sharing: processes can share physical memory frames
  • Hide fragmentation: can run if not enough total/contiguous memory

Memory Access

memory access

  1. Processor sends VA to MMU
  2. MMU fetches PTE from PT in memory
  3. PTE absent, MMU triggers page fault exception
  4. Handler identifies victim page (if dirty, page it out to disk)
  5. Handler pages in new page; update PTE
  6. Handler returns to original process; restart faulting instruction

Multithreading Basics

thread x throughput

linear increase -> increase slowdown -> degrade

Parallelism within Processor

  • Example x = x * (data[i] * data[i+1])
  • Limitations
    • Number of functional units
    • Latency of operation causing dependency
    • Number of registers to hold temporaries

Connection between Memory Access, Memory Allocation and Cache Efficiency

  • Dynamic memory allocation
    • Space utilization
    • Time complexity on malloc/free
    • Memory-wall: growing disparity of CPU & RAM speeds
      • Caching effectiveness important
      • Padding not cache-friendly: avoid internal fragmentation
  • False sharing: occurs when >= 2 processors access different data in same cache line, and at least one of them writes
    • Memory arenas: 1 thread only use memory from a single arena (continuous blocks of memory)

Cache Coherence

Cache Coherence Overview

  • Shared memory machines
    • Small # of processors: shared memory with coherent caches (SMP)
    • Large # of processors: distributed shared memory with coherent caches (CC-NUMA)
  • Caches in multiprocessors
    • Same line appears in >= 2 caches, one write, others read

CC Protocols

MSI Protocol

# Input/Action
                PrWr/BuRdX
    ⬈-----------------------------------⬊
      PrRd/BuRd     PrRd/-    PrWr/BuRdX
I ------------------> S ------------------> M
  <------------------   <------------------
        BuRdX/-     BuRd/-    BuRd/Flush
    ⬉-----------------------------------⬋
                BuRdX/Flush
  • Assumption: bus based architecture
    • Bus is reliable, ordered broadcast bus (snooping bus)
  • States of a cache line
    • Invalid
    • Shared: one of many cached copies
    • Modified: sole valid copy
  • Processor events
    • PrRd
    • PrWr
  • Bus Transactions
    • BusRd: simply asks for copy
    • BusRdX: asks for copy to modify
    • Flush: updates memory
  • Actions
    • Update state, perform bus transaction, flush value onto bus
  • Problem
    • Reading & modifying data is 2 bus xactions (BusRd(I->S) + BusRdX)
    • A write to shared will generate invalidation request

MESI Protocol

  • One more state
    • Exclusive/Exclusive-clean: only this cache has copy but not modified
    • A write to exclusive will not generate invalidation request
  • Typically built on write-back caches

Locality in Parallel Programming

  • Cache-aware access
    • Cache invalidation traffic important
  • Awareness of data placement in memory
    • Important for CC-NUMA because long memory latencies for non-local memory access
  • Awareness of data assignment to threads when load balancing
for (i = 0; i < n; i++) 
    a[i] = i;
#pragma omp parallel for
for (i = 0; i < n; i++)
    b[i] = f(a[i-1], a[i]);

-> better locality
#pragma omp parallel for
for (i = 0; i < n; i++) 
    a[i] = i;
#pragma omp parallel for
for (i = 0; i < n; i++)
    b[i] = f(a[i-1], a[i]);

True & False Sharing

  • True sharing: 2 threads accessing same locations on one block
  • False sharing: 2 threads accessing distinct locations on one block

    • Block will ping-pong
      • Ensure they map to separate cache blocks e.g. insert padding
    • Example

        # processor = 2, cache line fit 8 elements
        #pragma omp parallel for schedule(cyclic) 
        for (i = 0; i < n; i++)
            a[i] = b[i];
      

results matching ""

    No results matching ""