Content

  1. Parallelism Overview
    1. Standard Models of Parallelism
    2. Speedup
    3. Dependence
  2. Threads & Processes
    1. Pthreads
    2. Synchronization
  3. Data Parallelism
    1. Matrix Multiplication (Regular)
    2. SOR (Regular)
    3. Molecular Dynamics (Irregular)
  4. Task Parallelism
    1. PIPE (Pipelines)
    2. TSP (Task Queue)

Parallelism Overview

Standard Models of Parallelism

  • Shared memory (pthreads)
  • Shared memory + data parallelism (OpenMP)
  • Message passing (MPI)
  • Transactional memory (TM)

Speedup

  • speedup on problem = sequential execution time of best known sequential algorithm / execution time on p processors
    • Avoids picking easily parallelizable algorithm with poor sequential execution time
  • Linear speedup is best to be achieved
    • Sub-linear speedup: due to overhead of startup, synchronization, communication, ...
    • Super-linear speedup
      • Cache/memory effects
      • Nondeterminism in search problems
  • Amdahl's Law: if 1/s of program is sequential, speedup will not be better than s

Dependence

  • True dependence
    • S2 reads a value written by S1
  • Anti dependence
    • S2 writes a value read by S1
  • Output dependence
    • S2 writes a value written by S1
  • Loop carried dependence
    • Statements are part of the execution of a loop
    • e.g. a[i] = f(a[i-1]) + 1;
    • Prevents loop iteration parallelization
    • Level of loop carried dependence: nesting depth of loop that carries dependence
    • <-> Loop independent dependence

Threads & Processes

  • Similarities
    • Own logical control flow
    • Run concurrently with others
  • Differences
    • Threads share code & data, processes don't
    • Process control & context switches more expensive than thread control & thread switches

Pthreads

  • User has to:
    • Decompose computation into parallel parts
    • Create/destroy threads
    • Add synchronization to ensure dependences covered
  • Functions
    • pthread_create()
    • pthread_exit()
    • pthread_join()

Synchronization

Parallelization Options

  • Course-grained locking: 1 big lock for critical section
    • Limited parallelism
    • Easy to implement
  • Fine-grained locking: 1 lock per variable used in critical section
    • Good parallelism: less dependencies
    • Hard to implement
      • Deadlock, fault tolerance, priority inversion, ...
  • Semaphores
  • Fork-Join
  • Barriers
  • Reductions

Data Parallelism

Processors do the same thing on different data.

  • Regular: linear indexing
    • All arrays accessed through linear expressions of loop indices, known at compile time
  • Irregular: non-linear indexing
    • Some arrays accessed through non-linear expressions of loop indices, some only known at runtime
    • Difficult for parallelism based on data distribution
    • Not difficult for parallelism based on iteration distribution

Matrix Multiplication (Regular)

for (i = 0; i < n; i++) 
    for (j = 0; j < n; j++)
        c[i][j] = 0.0; 
for (i = 0; i < n; i++)
    for (j = 0; j < n; j++)
        for (k = 0; k < n; k++)
            c[i][j] += a[i][k] * b[k][j];
  • No loop-carried dependences on i-/j- loop
    • Can run in parallel
  • Loop-carried dependence on k-loop
  • Data Distribution
    • Block distribution
    • Block distribution by row
    • Block distribution by column
    • Cyclic distrubution by column
    • Block cyclic
    • Combinations

SOR (Regular)

for some number of timesteps/iterations {
    for (i = 1; i < n; i++)
        for (j = 1; j < n; j++)
            temp[i][j] = 0.25 *
                (grid[i-1][j] + grid[i+1][j] + grid[i][j-1] + grid[i][j+1]);
    for (i = 1; i < n; i++)
        for (j = 1; j < n; j++)
            grid[i][j] = temp[i][j];
}
  • No dependences between 1st (i,j) loop nest
    • Can run in parallel
  • No dependences between 2nd (i,j) loop nest
    • Can run in parallel
  • Anti-dependence between 1st & 2nd loop nest in the same timestep
    • Fork-join
  • True-dependence between 2nd & 1st loop nest of next timestep
    • Fork-join

Molecular Dynamics (Irregular)

for some number of timesteps { 
    for all molecules i
        for all nearby molecules j
            force[i] += f(loc[i], loc[j]);
    for all molecules i
        loc[i] = g(loc[i], force[i]);
}

for each molecule i
    number of nearby molecules count[i]
    array of indices of nearby molecules index[j] // 0 <= j < count[i]

for some number of timesteps { 
    for (i = 0; i < num_mol; i++)
        for (j = 0; j < count[i]; j++)
            force[i] += f(loc[i], loc[index[j]]);
    for (i = 0; i < num_mol; i++)
        loc[i] = g(loc[i], force[i]);
}
  • No loop-carried dependences in 1st i-loop
    • Can run in parallel
    • May have load balancing problem
  • Loop-carried dependence (reduction) on j-loop
  • No loop-carried dependences in 2nd i-loop
    • Can run in parallel
  • True-dependence between 1st & 2nd i-loop
    • Fork-join between loops

Task Parallelism

Processors do different tasks.

PIPE (Pipelines)

  1. Case 1

     for (i =0 ; i < num_pic, read(in_pic[i]); i++) {     
         int_pic_1[i] = trans1(in_pic[i]); 
         int_pic_2[i] = trans2(int_pic_1[i]); 
         int_pic_3[i] = trans3(int_pic_2[i]); 
         out_pic[i] = trans4(int_pic_3[i]);
     }
    
     // Assign each transformation to a processor
     // Processor 1
     for (i = 0; i < num_pics, read(in_pic[i]); i++) { 
         int_pic_1[i] = trans1(in_pic[i]); 
         signal(event_1_2[i]);
     }
     // Processor 2
     for (i = 0; i < num_pics; i++) {
         wait(event_1_2[i]);
         int_pic_2[i] = trans1(int_pic_1[i]);
         signal(event_2_3[i]);
     }
     // Processor 3
     ...
     // Processor 4
     for (i = 0; i < num_pics; i++) {
         wait(event_3_4[i]);
         out_pic[i] = trans1(int_pic_3[i]); 
     }
    
    • Problem
      • Each stage (transformation) may take different time to finish
      • Stages take a variable amount of time
    • Use semaphore_wait/signal with pthread
  2. Case 2

     for (i =0 ; i < num_pic, read(in_pic[i]); i++) {     
         int_pic_1 = trans1(in_pic); 
         int_pic_2 = trans2(int_pic_1); 
         int_pic_3 = trans3(int_pic_2); 
         out_pic = trans4(int_pic_3);
     }
    
    • Anti-dependence between stages -> no parallelism
      • Privatization
      • Buffer between stages; block when buffers are full/empty

TSP (Task Queue)

init_q(); init_best();
while ((p = de_queue()) != NULL) {
    for each expansion by one city { 
        q = add_city(p);
        if (complete(q)) { update_best(q) } 
        else { en_queue(q) }
    }
}

Balance between granularity for load balance and overhead for synchronization.

  1. Each process one expansion
  2. Each process do expansion of one partial path

     en_queue()/de_queue() {
         pthreads_mutex_lock(&queue);
         ...;
         pthreads_mutex_unlock(&queue);
     }
     update_best() {
         pthreads_mutex_lock(&best);
         ...;
         pthreads_mutex_unlock(&best);
     }
     de_queue() {
         while ((q is empty) and (not done)) {
             waiting++;
             if (waiting == p) {
                 done = true;
                 pthreads_cond_broadcast(&empty, &queue);
             }
             else {
                 pthreads_cond_wait(&empty, &queue);
                 waiting--; 
             }
         }
         if (done) return null; 
         else remove and return head of the queue; 
     }
    
  3. Each process do expansion of multiple partial path

Busy Waiting

initially: flag = 0;
P1: produce data; flag = 1;
P2: while (!flag); consume data;

Used when few threads competing for core or short critical sections.

results matching ""

    No results matching ""