Content

  1. Very Long Instruction Word (VLIW)
    1. Extracting ILP in Software
      1. Loop Unrolling
      2. Software Pipelining
      3. Trace Scheduling
    2. Problems
  2. Vector Processor
    1. Vector Instruction Parallelism
    2. Vector Chaining
  3. Multimedia/SIMD Extensions
    1. Multimedia v.s. Vector

Very Long Instruction Word (VLIW)

  • Pack multiple instructions into 1
  • Each slot fixed operation
  • Constant operation latency specified
  • Requires guarantee of
    • Parallelism within instruction -> no RAW check
    • No data use before ready -> no data interlocks

  • H/W
    • Does little to extract ILP
    • Simple and deterministic
    • Expose timing information to S/W
    • No hazard check during runtime
  • S/W
    • Extract ILP as much as possible
    • Schedule parallel operations
    • Insert NOPs to avoid hazard

Extracting ILP in Software

Loop Unrolling

  • Benefits
    • Increased basic block size -> better for compiler to schedule
    • Particularly good when no loop dependency
  • Tradeoffs
    • Increased code size
    • Increased pressure on cache
    • Final cleanup needed
Software Pipelining

  • Instructions from different iterations can execute in one parallel instruction
  • Loop unrolling often in combination
    • Modulo scheduling
Loop Unrolling v.s. Software Pipelining

Startup/wind-down overhead once per iteration v.s. once per loop

Trace Scheduling

If no loops, irregular codes make it difficult to find ILP in individual blocks.

  • Pick the most frequent branch path as the trace
  • Schedule the trace at once
  • Add fixup code
  • Profiling feedback or compiler heuristics to find common branch paths

Problems

  • Classic VLIW
    • Object-code compatibility
      • Recompile all code for every machine
    • Object code size
      • Instruction padding wastes instruction memory/cache
      • Loop unrolling/software pipelining replicates code
    • Scheduling variable latency memory operations
      • Caches/memory bank conflicts impose statically unpredictable variability
    • Knowing branch probabilities
      • Profiling -> a significant extra step in build process
    • Scheduling for statically unpredictable branches
      • Optimal schedule varies with branch path
  • Static Scheduling
    • Unpredictable branches
    • Variable memory latency (unpredictable cache misses)
    • Code size explosion
    • Compiler complexity
  • VLIW
    • Failed in general-purpose computing area
    • Successful in embedded DSP market

Vector Processor

  • Scalar unit
    • Load/store structure
  • Vector extension
    • Vector registers
    • Vector instructions
  • Implementation
    • H/W control
    • Highly pipelined functional units
    • Interleaved memory systems
    • No data caches
    • No virtual memory

  • Advantages
    • Compact
      • 1 instruction encodes multiple operations
    • Expressive, tells H/W that these operations are:
      • independent
      • use the same functional unit
      • access disjoint registers
      • access registers in same pattern as previous instructions
      • access contiguous block of memory (unit-stride load/store)
      • access memory in known pattern (strided load/store)
    • Scalable
      • Can run same code on more parallel pipelines (lanes)




Vector Instruction Parallelism

32 elements per vector register, 8 lanes -> 24 operations/cycle

Vector Chaining

Multimedia/SIMD Extensions

Single instruction multiple data.

Multimedia v.s. Vector

  • Limited instruction set
    • No vector length control
    • No strided load/store or scatter/gather
    • Unit-stride loads aligned to 64/128-bit boundary
  • Limited vector register length
    • Requires superscalar dispatch to keep units busy
    • Loop unrolling to hide latencies -> increases register pressure

References

results matching ""

    No results matching ""