Content
- Instruction Scheduling
- H/W Resources
- Data Dependencies
- Software Pipelining
- Example
- Modulo Scheduling
Instruction Scheduling
- Motivation
- RISC architecture philosophy: expose H/W features to compilers for optimizations
- Scheduling Constraints
- H/W resources
- Data dependencies
- Control dependencies
H/W Resources
- Conflict between register allocation & instruction scheduling
- Register allocation: minimize # of registers
- Instruction scheduling: reduce stalls by using more registers
- Cache
- e.g. loop reordering, cache blocking
Data Dependencies
Directed acyclic graph with edges indicating dependency. Find topological sort to produce better sequence.
- Heuristics
- Instructions that can run to completion without interference before those with interference
- Instructions with more dependents before those with less dependents
- Latency-weighted path
Software Pipelining
- Hiding instruction latency
- Increase distance between dependent operations by spreading them over more iterations; latency can be absorbed by other non-dependent operations
- Often used in combination with loop unrolling
- Variable latency of instructions?
- Unroll multiple times, loop body huge; bad for cache
Example
for i = 1 to bignumber
A(i) ; 3 cycle latency
B(i) ; 3
C(i) ; 12(perhaps a floating point operation)
D(i) ; 3
E(i) ; 3
F(i) ; 3
end
->
prologue
for i = 1 to (bignumber - 6)
A(i+6)
B(i+5)
C(i+4)
D(i+2) ; note that we skip i+3
E(i+1)
F(i)
end
epilogue
Distance of 12
between C(i)
and D(i)
.
; loop prologue (arranged on lines for clarity)
A(1)
A(2), B(1)
A(3), B(2), C(1)
A(4), B(3), C(2)
A(5), B(4), C(3), D(1)
A(6), B(5), C(4), D(2), E(1)
; loop epilogue (arranged on lines for clarity)
B(bignumber), C(bignumber-1), D(bignumber-3), E(bignumber-4), F(bignumber-5)
C(bignumber), D(bignumber-2), E(bignumber-3), F(bignumber-4)
D(bignumber-1), E(bignumber-2), F(bignumber-3)
D(bignumber), E(bignumber-1), F(bignumber-2)
E(bignumber), F(bignumber-1)
F(bignumber)
Modulo Scheduling
Interleaving instructions of an inner loop to achieve better parallelism of software pipelining.
Repeatedly schedule body of loop with different initiation intervals until perfect schedule reached.