Content
- Very Long Instruction Word (VLIW)- Extracting ILP in Software- Loop Unrolling
- Software Pipelining
- Trace Scheduling
 
- Problems
 
- Extracting ILP in Software
- Vector Processor- Vector Instruction Parallelism
- Vector Chaining
 
- Multimedia/SIMD Extensions- Multimedia v.s. Vector
 
Very Long Instruction Word (VLIW)
- Pack multiple instructions into 1
- Each slot fixed operation
- Constant operation latency specified
- Requires guarantee of- Parallelism within instruction -> no RAW check
- No data use before ready -> no data interlocks
 

- H/W- Does little to extract ILP
- Simple and deterministic
- Expose timing information to S/W
- No hazard check during runtime
 
- S/W- Extract ILP as much as possible
- Schedule parallel operations
- Insert NOPs to avoid hazard
 
Extracting ILP in Software
Loop Unrolling

- Benefits- Increased basic block size -> better for compiler to schedule
- Particularly good when no loop dependency
 
- Tradeoffs- Increased code size
- Increased pressure on cache
- Final cleanup needed
 
Software Pipelining

- Instructions from different iterations can execute in one parallel instruction
- Loop unrolling often in combination- Modulo scheduling
 
Loop Unrolling v.s. Software Pipelining
Startup/wind-down overhead once per iteration v.s. once per loop
Trace Scheduling
If no loops, irregular codes make it difficult to find ILP in individual blocks.

- Pick the most frequent branch path as the trace
- Schedule the trace at once
- Add fixup code
- Profiling feedback or compiler heuristics to find common branch paths
Problems
- Classic VLIW- Object-code compatibility- Recompile all code for every machine
 
- Object code size- Instruction padding wastes instruction memory/cache
- Loop unrolling/software pipelining replicates code
 
- Scheduling variable latency memory operations- Caches/memory bank conflicts impose statically unpredictable variability
 
- Knowing branch probabilities- Profiling -> a significant extra step in build process
 
- Scheduling for statically unpredictable branches- Optimal schedule varies with branch path
 
 
- Object-code compatibility
- Static Scheduling- Unpredictable branches
- Variable memory latency (unpredictable cache misses)
- Code size explosion
- Compiler complexity
 
- VLIW- Failed in general-purpose computing area
- Successful in embedded DSP market
 
Vector Processor
- Scalar unit- Load/store structure
 
- Vector extension- Vector registers
- Vector instructions
 
- Implementation- H/W control
- Highly pipelined functional units
- Interleaved memory systems
- No data caches
- No virtual memory
 
 

- Advantages- Compact- 1 instruction encodes multiple operations
 
- Expressive, tells H/W that these operations are:- independent
- use the same functional unit
- access disjoint registers
- access registers in same pattern as previous instructions
- access contiguous block of memory (unit-stride load/store)
- access memory in known pattern (strided load/store)
 
- Scalable- Can run same code on more parallel pipelines (lanes)
 
 
- Compact




Vector Instruction Parallelism
32 elements per vector register, 8 lanes -> 24 operations/cycle

Vector Chaining
 

Multimedia/SIMD Extensions
Single instruction multiple data.

Multimedia v.s. Vector
- Limited instruction set- No vector length control
- No strided load/store or scatter/gather
- Unit-stride loads aligned to 64/128-bit boundary
 
- Limited vector register length- Requires superscalar dispatch to keep units busy
- Loop unrolling to hide latencies -> increases register pressure