Content

  1. Overview
  2. Floating Point Unit (FPU)
  3. Complex Pipeline Control
    1. Complex In-Order Pipeline
    2. In-Order Superscalar Pipeline
  4. Reordering of Instructions
    1. Ensuring Correct Issues without Equalizing Pipeline Depths or Bypassing
    2. Simplifying Data Structure (Scoreboarding)
    3. Out-of-Order Issuing
    4. Tomasulo's Algorithm
  5. Superscalar Control Logic Scaling

Overview

  • Beyond simple pipeline
    • CPI >= 1
      • Out-of-order issue, execute, commit
      • Tomasulo Architecture
    • CPI < 1
      • Superscalar, dynamic OOO processor
      • Multi-scalar processor
      • VLIW
      • Vector Processor
      • SIMD
      • GPU acceleration
      • Multi-core processor
      • Reconfigurable Computer
  • Challenges of simple pipeline
    • Operations have variable or long latency
    • Partially pipelined
      • Memory operation
      • Floating operation
    • On cache miss/TLB miss/page fault:
      • H/W stall
      • Trap to OS

Floating Point Unit (FPU)

  • More H/W than integer units
  • Common to have several FPUs/different types (Fadd, Fmul, Fdiv, ...)
  • Pipelined/partially pipelined/not pipelined
  • To operate FPUs concurrently, the FP register file needs to have more r/w ports
  • Internal pipeline registers
  • Interaction between FP datapath & integer datapath is determined by ISA
    • RISC-V:
      • Separate register files, interaction via move/convert instructions
      • Separate load/store, but both use GPR for address calculation
      • FP compares writes integer registers & use integer branch

Complex Pipeline Control

  • Challenges
    • Structural conflicts at ex. stage: FPU/memory partially pipelined/not pipelined and takes > 1 cycle
    • Structural conflicts at wb. stage: variable latency
    • Out-of-order write hazards: variable latency
    • Exception handling

Complex In-Order Pipeline

  • Delay wb.: all operations have same latency to wb. stage
    • Write port never oversubscribed: 1 in & 1 out every cycle
    • Stall pipeline on long latency operations
    • Handle exceptions in-order at commit point
  • Bypassing to prevent increased wb. latency to slow down single cycle integer operations

In-Order Superscalar Pipeline

  • Fetch 2 instructions per cycle and issue them concurrently if 1 is integer/memory and 1 is FP
  • Wider issue may increase regfile ports & bypassing cost

Reordering of Instructions

  • Register v.s. memory dependence
    • Register dependence determined at decode stage
    • Memory dependence determined after computing address

Can swap ordering as long as no arrow pointing backwards.

  • General categories
    • In-Order Issue + In-Order Completion
    • In-Order Issue + Out-of-Order Completion
    • Out-of-Order Issue + Out-of-Order Completion

Ensuring Correct Issues without Equalizing Pipeline Depths or Bypassing

  • Considerations at instruction issuing
    • Is FU available?
      • Busy?
    • Is input data available? (RAW)
      • Dest of source
    • Is it safe to write destination? (WAR, WAW)
      • WAR: Src1 & Src2 of destination
      • WAW: Dest of destination
    • Structural conflict at wb. stage?

  • Entries added if no hazards
  • Entries removed after wb. statge

Simplifying Data Structure (Scoreboarding)

Assuming in-order issue, then WAR won't occur -> no need Src1 & Src2.
Avoid WAW by ensuring at most 1 appearance of 1 register in Dest column.

  • Busy[FU#]: availability
    • Bits hardwired to FU's
  • WP[reg#]: write pending
    • Bits set by issue stage & cleared by wb. stage
    • FUs must carry dest field & valid flag (we,ws)
Check condition Reference
FU available? Busy[FU]
RAW? WP[src]
WAW? WP[dest]
WAR? not possible

  • Limitations
    • Prevent later instructions with no dependencies from being issued

Out-of-Order Issuing

  • Issue buffer holds multiple instructions
  • Decode adds instructions to buffer if no WAW or WAR
  • Instructions in buffer can be issued if no RAW
  • Limitations
    • No significance improvements due to data hazards
    • Number of instructions in pipeline limited by number of registers

Tomasulo's Algorithm

WAW & WAR can be removed by register renaming.

  • FU buffers - reservation stations (RS)
    • Contains pending operands
    • Registers replaced by pointers to RS
      • On-the-fly register renaming
      • RSs more than registers
  • Common data bus (data + source) broadcasts results to all FUs
    • Update pending data
  • Load/store treated as FUs with RSs
  • Branch prediction allows FP ops beyond basic blocks in FP queue
  • Stages
    1. Issue
      • If RS free
    2. Execute
      • Execute if both operands ready
      • Else wait for common data bus for result
    3. Write result (to reorder buffer)
      • Write on common data bus
      • Mark RS available
    4. Commit (to register/memory)
      • ROB stores instruction in original fetch order
      • Avoids WAW
      • Precise exception -> ensures we can restart program
      • Easy roll-back control for branch mispredictions
      • BUT added H/W
      • BUT CPI=1 barrier if not careful

Superscalar Control Logic Scaling

  • Each instruction check against W * L instructions -> growth in H/W ∝ W * (W * L)
  • In-order machines
    • L related to pipeline latencies
    • Check done during issue (interlocks, scoreboard)
  • Our-of-order machines
    • L also includes time in instruction buffers
    • Check done by broadcasting tags to waiting instructions at write back
  • W increases -> larger instruction window needed to find enough parallelism to keep machine busy -> greater L

References

results matching ""

    No results matching ""