Content

Overview
Floating Point Unit (FPU)
Complex Pipeline Control
1. Complex In-Order Pipeline
2. In-Order Superscalar Pipeline
Reordering of Instructions
1. Ensuring Correct Issues without Equalizing Pipeline Depths or Bypassing
2. Simplifying Data Structure (Scoreboarding)
3. Out-of-Order Issuing
4. Tomasulo's Algorithm
Superscalar Control Logic Scaling

Overview

Beyond simple pipeline
- CPI >= 1
  - Out-of-order issue, execute, commit
  - Tomasulo Architecture
- CPI < 1
  - Superscalar, dynamic OOO processor
  - Multi-scalar processor
  - VLIW
  - Vector Processor
  - SIMD
  - GPU acceleration
  - Multi-core processor
  - Reconfigurable Computer
Challenges of simple pipeline
- Operations have variable or long latency
- Partially pipelined
  - Memory operation
  - Floating operation
- On cache miss/TLB miss/page fault:
  - H/W stall
  - Trap to OS

Floating Point Unit (FPU)

More H/W than integer units
Common to have several FPUs/different types (Fadd, Fmul, Fdiv, ...)
Pipelined/partially pipelined/not pipelined
To operate FPUs concurrently, the FP register file needs to have more r/w ports
Internal pipeline registers
Interaction between FP datapath & integer datapath is determined by ISA
- RISC-V:
  - Separate register files, interaction via move/convert instructions
  - Separate load/store, but both use GPR for address calculation
  - FP compares writes integer registers & use integer branch

Complex Pipeline Control

Challenges
- Structural conflicts at ex. stage: FPU/memory partially pipelined/not pipelined and takes > 1 cycle
- Structural conflicts at wb. stage: variable latency
- Out-of-order write hazards: variable latency
- Exception handling

Complex In-Order Pipeline

Delay wb.: all operations have same latency to wb. stage
- Write port never oversubscribed: 1 in & 1 out every cycle
- Stall pipeline on long latency operations
- Handle exceptions in-order at commit point
Bypassing to prevent increased wb. latency to slow down single cycle integer operations

In-Order Superscalar Pipeline

Fetch 2 instructions per cycle and issue them concurrently if 1 is integer/memory and 1 is FP
Wider issue may increase regfile ports & bypassing cost

Reordering of Instructions

Register v.s. memory dependence
- Register dependence determined at decode stage
- Memory dependence determined after computing address

Can swap ordering as long as no arrow pointing backwards.

General categories
- In-Order Issue + In-Order Completion
- In-Order Issue + Out-of-Order Completion
- Out-of-Order Issue + Out-of-Order Completion

Ensuring Correct Issues without Equalizing Pipeline Depths or Bypassing

Considerations at instruction issuing
- Is FU available?
  - Busy?
- Is input data available? (RAW)
  - Dest of source
- Is it safe to write destination? (WAR, WAW)
  - WAR: Src1 & Src2 of destination
  - WAW: Dest of destination
- Structural conflict at wb. stage?

Entries added if no hazards
Entries removed after wb. statge

Simplifying Data Structure (Scoreboarding)

Assuming in-order issue, then WAR won't occur -> no need Src1 & Src2.
Avoid WAW by ensuring at most 1 appearance of 1 register in Dest column.

Busy[FU#]: availability
- Bits hardwired to FU's
WP[reg#]: write pending
- Bits set by issue stage & cleared by wb. stage
- FUs must carry dest field & valid flag (we,ws)

Check condition	Reference
FU available?	`Busy[FU]`
RAW?	`WP[src]`
WAW?	`WP[dest]`
WAR?	not possible

Limitations
- Prevent later instructions with no dependencies from being issued

Out-of-Order Issuing

Issue buffer holds multiple instructions
Decode adds instructions to buffer if no WAW or WAR
Instructions in buffer can be issued if no RAW
Limitations
- No significance improvements due to data hazards
- Number of instructions in pipeline limited by number of registers

Tomasulo's Algorithm

WAW & WAR can be removed by register renaming.

FU buffers - reservation stations (RS)
- Contains pending operands
- Registers replaced by pointers to RS
  - On-the-fly register renaming
  - RSs more than registers
Common data bus (data + source) broadcasts results to all FUs
- Update pending data
Load/store treated as FUs with RSs
Branch prediction allows FP ops beyond basic blocks in FP queue
Stages
1. Issue
  - If RS free
2. Execute
  - Execute if both operands ready
  - Else wait for common data bus for result
3. Write result (to reorder buffer)
  - Write on common data bus
  - Mark RS available
4. Commit (to register/memory)
  - ROB stores instruction in original fetch order
  - Avoids WAW
  - Precise exception -> ensures we can restart program
  - Easy roll-back control for branch mispredictions
  - BUT added H/W
  - BUT CPI=1 barrier if not careful

Superscalar Control Logic Scaling

Each instruction check against W * L instructions -> growth in H/W ∝ W * (W * L)
In-order machines
- L related to pipeline latencies
- Check done during issue (interlocks, scoreboard)
Our-of-order machines
- L also includes time in instruction buffers
- Check done by broadcasting tags to waiting instructions at write back
W increases -> larger instruction window needed to find enough parallelism to keep machine busy -> greater L

References

GaTech Notes

Advanced Pipeline