Content

Very Long Instruction Word (VLIW)

Pack multiple instructions into 1
Each slot fixed operation
Constant operation latency specified
Requires guarantee of
- Parallelism within instruction -> no RAW check
- No data use before ready -> no data interlocks

H/W
- Does little to extract ILP
- Simple and deterministic
- Expose timing information to S/W
- No hazard check during runtime
S/W
- Extract ILP as much as possible
- Schedule parallel operations
- Insert NOPs to avoid hazard

Benefits
- Increased basic block size -> better for compiler to schedule
- Particularly good when no loop dependency
Tradeoffs
- Increased code size
- Increased pressure on cache
- Final cleanup needed

Loop Unrolling v.s. Software Pipelining

Startup/wind-down overhead once per iteration v.s. once per loop

If no loops, irregular codes make it difficult to find ILP in individual blocks.

Classic VLIW
- Object-code compatibility
  - Recompile all code for every machine
- Object code size
  - Instruction padding wastes instruction memory/cache
  - Loop unrolling/software pipelining replicates code
- Scheduling variable latency memory operations
  - Caches/memory bank conflicts impose statically unpredictable variability
- Knowing branch probabilities
  - Profiling -> a significant extra step in build process
- Scheduling for statically unpredictable branches
  - Optimal schedule varies with branch path
Static Scheduling
- Unpredictable branches
- Variable memory latency (unpredictable cache misses)
- Code size explosion
- Compiler complexity
VLIW
- Failed in general-purpose computing area
- Successful in embedded DSP market

32 elements per vector register, 8 lanes -> 24 operations/cycle

Single instruction multiple data.

Limited instruction set
- No vector length control
- No strided load/store or scatter/gather
- Unit-stride loads aligned to 64/128-bit boundary
Limited vector register length
- Requires superscalar dispatch to keep units busy
- Loop unrolling to hide latencies -> increases register pressure