Content
- Locks & Cache Coherence
- Atomic Operations
- Ticket Lock
- Array-Based Queueing Locks
- List-Based Queueing Locks (MCS Locks)
- Summary
- Case Studies
Locks & Cache Coherence
- Criteria for evaluating spin locks
- Scalability & induced network load
- Single-processor latency
- Space requirements
- Fairness
- Implementability with available atomic operations
Atomic Operations
H/W support is required to implement synchronization primitives.
Test & set
- Steps
- Load old value to register
- Set value in memory to 1
- Return old value
Lock implementation (TAS)
struct lock { int held = 0; } void acquire (lock) { while (test-and-set(&lock->held)); } void release (lock) { lock->held = 0; }
- Contentions
- 2 threads contending to acquire lock & modify lock to 1
- Caused by spinning on global variables
- Poor scalability
Improved lock implementation (TATAS)
void test_and_test_and_set (lock) { do { while (lock->held == 1) // only read ; // spin } } while (test_and_set(lock->held)); } void release (lock) { lock->held = 0; }
Further improved lock implementation (TAS with backoff)
while test&set (L) fails { pause (delay); delay = delay * 2; }
- Steps
- Swap
- Fetch & Op
- Compare & swap
Ticket Lock
my_ticket = fetch_and_increment(next_ticket);
while (my_ticket != now_serving)
pause(); //back-off
- Improvements
- Only 1 process try to get lock upon release
- Only 1 process have read miss upon release (not achieved)
- Locked acquired in FIFO (fairness)
- Further improvement: proportional backoff
- Components
- 2 counters:
next_ticket
,now_serving
- Lock acquire:
fetch-and-increment
onnext_ticket
- Lock release: increment on
now_serving
-> cuases contention
- 2 counters:
Array-Based Queueing Locks
- Components
- Queue
slots[N]
, each element lies in different cache line- Has-lock (HL)
- Must-wait (MW)
next_slot
- Queue
- Steps
- Incoming processes enqueued (
my_place = fetch-and-inc(next_slot)
) - Lock-acquire:
while(slots[my_place] == MW);
- Lock-release:
slots[(my_place+1)%N] = HL;
- Incoming processes enqueued (
- Pros
- Atomic ops
fetch-and-inc
to obtain location to spin on - Each CPU spins on different location in a distinct cache line
- Each CPU clears the lock for its successor (must-wait -> has-lock)
- Atomic ops
- Cons
- Space complexity
O(N)
(bounded by # of processors) - Queue is static
- Space complexity
List-Based Queueing Locks (MCS Locks)
Steps
- Incoming processes enqueued (
predecessor = fetch-and-store(L,me)
) - Lock-acquire: wait for predecessor to signal
Lock-release: remove
me
fromL
& signal successorIf no successor, spin on myself using
compare-and-swap(L,me,nil)
L -> me new # compare-and-swap fails L me ⬊ new # spin on while I->next = nil L me ⬊ ↓ new
- Incoming processes enqueued (
- Properties
- Lock points at tail of queue
- Compare-and-swap for detection if it is the only processor in queue & atomic removal of self from queue
- Pros
- Spins on local flag variables only
- Requires a small constant amount of space per lock (bounded by # of requests)
Summary
- TAS with proper backoffs
- Scale well
- More network load
- Ticket lock with proper backoffs
- Scale well
- More network load
- Array-based locks
- Scales best
- Least interconnect contention
- High space requirement for large numbers of CPUs
- Fair
- MCS
- Scales best
- Least interconnect contention
- Low space requirement
- Fair
- Benefits from compare-and-swap
Case Studies
- 48-core AMD Opteron
- 6 cores per die, total 8 dies
- LLC not shared
- Directory-based cache coherence
80-core Intel Xeon
- 10 cores per die, total 8 dies
- LLC shared
- Snooping-based cache coherence
Cross-sockets communication can be 2-hops
- Cross-socket communication expensive
- Latency of remote access
- Read
- Opteron
- Uniform latency regardless of cache state (directory-based cache coherence)
- Xeon
- Shared state faster (served from LLC)
- Opteron
- Write
- Opteron
- Store to shared expensive (wait for all invalidations to complete)
- Xeon
- Uniform latency regardless of cache state (snooping-based cache coherence)
- Opteron
- Read
- Implications
- Cache coherence is expensive
- Avoid unnecessary sharing
- Crossing sockets is expensive
- Can be clower than running on single core
pthread
CPU affinity mask: pin cooperative threads on cores within same die
- Loads & stores can be as expensive as atomic operations
- Cache coherence is expensive