Content
- Very Long Instruction Word (VLIW)
- Extracting ILP in Software
- Loop Unrolling
- Software Pipelining
- Trace Scheduling
- Problems
- Extracting ILP in Software
- Vector Processor
- Vector Instruction Parallelism
- Vector Chaining
- Multimedia/SIMD Extensions
- Multimedia v.s. Vector
Very Long Instruction Word (VLIW)
- Pack multiple instructions into 1
- Each slot fixed operation
- Constant operation latency specified
- Requires guarantee of
- Parallelism within instruction -> no RAW check
- No data use before ready -> no data interlocks
- H/W
- Does little to extract ILP
- Simple and deterministic
- Expose timing information to S/W
- No hazard check during runtime
- S/W
- Extract ILP as much as possible
- Schedule parallel operations
- Insert NOPs to avoid hazard
Extracting ILP in Software
Loop Unrolling
- Benefits
- Increased basic block size -> better for compiler to schedule
- Particularly good when no loop dependency
- Tradeoffs
- Increased code size
- Increased pressure on cache
- Final cleanup needed
Software Pipelining
- Instructions from different iterations can execute in one parallel instruction
- Loop unrolling often in combination
- Modulo scheduling
Loop Unrolling v.s. Software Pipelining
Startup/wind-down overhead once per iteration v.s. once per loop
Trace Scheduling
If no loops, irregular codes make it difficult to find ILP in individual blocks.
- Pick the most frequent branch path as the trace
- Schedule the trace at once
- Add fixup code
- Profiling feedback or compiler heuristics to find common branch paths
Problems
- Classic VLIW
- Object-code compatibility
- Recompile all code for every machine
- Object code size
- Instruction padding wastes instruction memory/cache
- Loop unrolling/software pipelining replicates code
- Scheduling variable latency memory operations
- Caches/memory bank conflicts impose statically unpredictable variability
- Knowing branch probabilities
- Profiling -> a significant extra step in build process
- Scheduling for statically unpredictable branches
- Optimal schedule varies with branch path
- Object-code compatibility
- Static Scheduling
- Unpredictable branches
- Variable memory latency (unpredictable cache misses)
- Code size explosion
- Compiler complexity
- VLIW
- Failed in general-purpose computing area
- Successful in embedded DSP market
Vector Processor
- Scalar unit
- Load/store structure
- Vector extension
- Vector registers
- Vector instructions
- Implementation
- H/W control
- Highly pipelined functional units
- Interleaved memory systems
- No data caches
- No virtual memory
- Advantages
- Compact
- 1 instruction encodes multiple operations
- Expressive, tells H/W that these operations are:
- independent
- use the same functional unit
- access disjoint registers
- access registers in same pattern as previous instructions
- access contiguous block of memory (unit-stride load/store)
- access memory in known pattern (strided load/store)
- Scalable
- Can run same code on more parallel pipelines (lanes)
- Compact
Vector Instruction Parallelism
32 elements per vector register, 8 lanes -> 24 operations/cycle
Vector Chaining
Multimedia/SIMD Extensions
Single instruction multiple data.
Multimedia v.s. Vector
- Limited instruction set
- No vector length control
- No strided load/store or scatter/gather
- Unit-stride loads aligned to 64/128-bit boundary
- Limited vector register length
- Requires superscalar dispatch to keep units busy
- Loop unrolling to hide latencies -> increases register pressure