Content
- Locking Parallelization Options
- Basic Parallel Improvements
- OpenMP & Code Restructuring Optimizations
- Characteristics of Parallel Code
Locking Parallelization Options
- Course-grained locking: easy to program; limited parallelism
- Fine-grained locking: hard to program; good parallelism
- Deadlock, fault tolerance, priority inversion, etc.
- Transactional memory: easy to program, promise good parallelism
- Checkpoints execution -> detects conflicts -> commits/aborts and reexecutes
- Machine limits: abort if transaction too long and has to evict L1 cahce
- Overhead
- Bookkeeping
- Failure recovery
Basic Parallel Improvements
OpenMP & Code Restructuring Optimizations
Compiler & library implementation.
annotated source -> OpenMP compiler -> sequential program
(compiler switch)
-> parallel program
OpenMP Directives
- Parallelization directives
#pragma omp parallel for
#pragma omp parallel for schedule <block/cyclic>
- Data environment directives
shared
,private
,threadprivate
,reduction
Code Restructuring Optimizations
Eliminate/reduce dependences by restructuring the code.
Private variables
#pragma omp parallel for private(tmp) for (i = 0; i < n; i++) { tmp = a[i]; a[i] = b[i]; b[i] = tmp; }
Induction variable elimination
for (i = 0, index = 0; i < n; i++) { index += i; a[i] = b[index]; } -> #pragma omp parallel for for (i = 0, index = 0; i < n; i++) { a[i] = b[i*(i+1)/2]; }
Loop splitting
for (i = 0, index = 0; i < n; i++) { index += f(i); b[i] = g(a[index]); } -> for (i = 0; i < n; i++) { index[i] += f(i); } #pragma omp parallel for for (i = 0; i < n; i++) { b[i] = g(a[index[i]]); }
while (*a) { process(a); a++; } -> for (count = 0, p = a; p != NULL; count++, p++); #pragma omp parallel for for (i = 0; i < count; i++) process(a[i]);
Loop reordering
for (k = 0; k < n; k++) for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[i][j] += b[i][k] + c[k][j]; -> #pragma omp parallel for for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) a[i][j] += b[i][k] + c[k][j];
Loop Fusion: reduce loop startup & thread overhead
#pragma omp parallel for for (i = 0; i < n; i++) a[i] = b[i]; #pragma omp parallel for for (i = 0; i < n; i++) c[i] = b[i]^2; -> #pragma omp parallel for for (i = 0; i < n; i++) { a[i] = b[i]; c[i] = b[i]^2; }
Loop peeling
for (i = 0, wrap = n; i < n; i++) { b[i] = a[i] + a[wrap]; wrap = i; } -> b[0] = a[0] + a[n]; #pragma omp parallel for for (i = 1; i < n; i++) { b[i] = a[i] + a[i-1]; }
for (i = 0; i < n; i++) a[i+m] = a[i] + b[i]; -> if (m > n) { #pragma omp parallel for for (i = 0; i < n; i++) a[i+m] = a[i] + b[i]; } else { ... cannot be parallelized }
Characteristics of Parallel Code
- Granularity
- Load balance
- Locality
- Communication & synchronization
Granularity
Size of program unit executed by a single processor.
- Fine granularity
- Ability to use many processors
- Finer-grain load balancing
- Increased overhead
- More critical section accesses & contention
Load balance
Different in execution time between processors between barriers.
- Execution time predictable?
- Regular data parallel: yes
- Irregular data paralle or pipeline: perhaps
- Task queue: no
- Static load balancing: done once by programmer
- Block: good locality, poor load balance
- Cyclic: good load balance, poor locality
- Block-cyclic
- Fine for regular data parallel
- Dynamic load balancing: done at runtime; task queue
- Centralized: easy to program, good load balance
- Distributed: less communication/synchronization; more memory & larger working set
- Task stealing: extra overhead, difficult to program, better load balance
- Unpredictable execution times
- Regular data parallel in heterogeneous/multitasked environment
- Usually high overhead
- Semi-static load balancing: done once at runtime
- Partition computation using measurements of cost of program parts
- Done once, done every iteration, done every n iterations