Content
- Overview
- Floating Point Unit (FPU)
- Complex Pipeline Control
- Complex In-Order Pipeline
- In-Order Superscalar Pipeline
- Reordering of Instructions
- Ensuring Correct Issues without Equalizing Pipeline Depths or Bypassing
- Simplifying Data Structure (Scoreboarding)
- Out-of-Order Issuing
- Tomasulo's Algorithm
- Superscalar Control Logic Scaling
Overview
- Beyond simple pipeline
- CPI >= 1
- Out-of-order issue, execute, commit
- Tomasulo Architecture
- CPI < 1
- Superscalar, dynamic OOO processor
- Multi-scalar processor
- VLIW
- Vector Processor
- SIMD
- GPU acceleration
- Multi-core processor
- Reconfigurable Computer
- CPI >= 1
- Challenges of simple pipeline
- Operations have variable or long latency
- Partially pipelined
- Memory operation
- Floating operation
- On cache miss/TLB miss/page fault:
- H/W stall
- Trap to OS
Floating Point Unit (FPU)

- More H/W than integer units
- Common to have several FPUs/different types (Fadd, Fmul, Fdiv, ...)
- Pipelined/partially pipelined/not pipelined
- To operate FPUs concurrently, the FP register file needs to have more r/w ports
- Internal pipeline registers
- Interaction between FP datapath & integer datapath is determined by ISA
- RISC-V:
- Separate register files, interaction via move/convert instructions
- Separate load/store, but both use GPR for address calculation
- FP compares writes integer registers & use integer branch
- RISC-V:
Complex Pipeline Control
- Challenges
- Structural conflicts at ex. stage: FPU/memory partially pipelined/not pipelined and takes > 1 cycle
- Structural conflicts at wb. stage: variable latency
- Out-of-order write hazards: variable latency
- Exception handling
Complex In-Order Pipeline

- Delay wb.: all operations have same latency to wb. stage
- Write port never oversubscribed: 1 in & 1 out every cycle
- Stall pipeline on long latency operations
- Handle exceptions in-order at commit point
- Bypassing to prevent increased wb. latency to slow down single cycle integer operations
In-Order Superscalar Pipeline

- Fetch 2 instructions per cycle and issue them concurrently if 1 is integer/memory and 1 is FP
- Wider issue may increase regfile ports & bypassing cost
Reordering of Instructions
- Register v.s. memory dependence
- Register dependence determined at decode stage
- Memory dependence determined after computing address

Can swap ordering as long as no arrow pointing backwards.
- General categories
- In-Order Issue + In-Order Completion
- In-Order Issue + Out-of-Order Completion
- Out-of-Order Issue + Out-of-Order Completion
Ensuring Correct Issues without Equalizing Pipeline Depths or Bypassing
- Considerations at instruction issuing
- Is FU available?
Busy?
- Is input data available? (RAW)
Destof source
- Is it safe to write destination? (WAR, WAW)
- WAR:
Src1&Src2of destination - WAW:
Destof destination
- WAR:
- Structural conflict at wb. stage?
- Is FU available?

- Entries added if no hazards
- Entries removed after wb. statge
Simplifying Data Structure (Scoreboarding)
Assuming in-order issue, then WAR won't occur -> no need Src1 & Src2.
Avoid WAW by ensuring at most 1 appearance of 1 register in Dest column.
Busy[FU#]: availability- Bits hardwired to FU's
WP[reg#]: write pending- Bits set by issue stage & cleared by wb. stage
- FUs must carry
destfield &validflag(we,ws)
| Check condition | Reference |
|---|---|
| FU available? | Busy[FU] |
| RAW? | WP[src] |
| WAW? | WP[dest] |
| WAR? | not possible |

- Limitations
- Prevent later instructions with no dependencies from being issued
Out-of-Order Issuing
- Issue buffer holds multiple instructions
- Decode adds instructions to buffer if no WAW or WAR
- Instructions in buffer can be issued if no RAW
- Limitations
- No significance improvements due to data hazards
- Number of instructions in pipeline limited by number of registers
Tomasulo's Algorithm
WAW & WAR can be removed by register renaming.

- FU buffers - reservation stations (RS)
- Contains pending operands
- Registers replaced by pointers to RS
- On-the-fly register renaming
- RSs more than registers
- Common data bus (data + source) broadcasts results to all FUs
- Update pending data
- Load/store treated as FUs with RSs
- Branch prediction allows FP ops beyond basic blocks in FP queue
- Stages
- Issue
- If RS free
- Execute
- Execute if both operands ready
- Else wait for common data bus for result
- Write result (to reorder buffer)
- Write on common data bus
- Mark RS available
- Commit (to register/memory)
- ROB stores instruction in original fetch order
- Avoids
WAW - Precise exception -> ensures we can restart program
- Easy roll-back control for branch mispredictions
- BUT added H/W
- BUT CPI=1 barrier if not careful
- Issue
Superscalar Control Logic Scaling

- Each instruction check against
W * Linstructions -> growth in H/W ∝W * (W * L) - In-order machines
Lrelated to pipeline latencies- Check done during issue (interlocks, scoreboard)
- Our-of-order machines
Lalso includes time in instruction buffers- Check done by broadcasting tags to waiting instructions at write back
Wincreases -> larger instruction window needed to find enough parallelism to keep machine busy -> greaterL