Content
- Overview
- Floating Point Unit (FPU)
- Complex Pipeline Control
- Complex In-Order Pipeline
- In-Order Superscalar Pipeline
- Reordering of Instructions
- Ensuring Correct Issues without Equalizing Pipeline Depths or Bypassing
- Simplifying Data Structure (Scoreboarding)
- Out-of-Order Issuing
- Tomasulo's Algorithm
- Superscalar Control Logic Scaling
Overview
- Beyond simple pipeline
- CPI >= 1
- Out-of-order issue, execute, commit
- Tomasulo Architecture
- CPI < 1
- Superscalar, dynamic OOO processor
- Multi-scalar processor
- VLIW
- Vector Processor
- SIMD
- GPU acceleration
- Multi-core processor
- Reconfigurable Computer
- CPI >= 1
- Challenges of simple pipeline
- Operations have variable or long latency
- Partially pipelined
- Memory operation
- Floating operation
- On cache miss/TLB miss/page fault:
- H/W stall
- Trap to OS
Floating Point Unit (FPU)
- More H/W than integer units
- Common to have several FPUs/different types (Fadd, Fmul, Fdiv, ...)
- Pipelined/partially pipelined/not pipelined
- To operate FPUs concurrently, the FP register file needs to have more r/w ports
- Internal pipeline registers
- Interaction between FP datapath & integer datapath is determined by ISA
- RISC-V:
- Separate register files, interaction via move/convert instructions
- Separate load/store, but both use GPR for address calculation
- FP compares writes integer registers & use integer branch
- RISC-V:
Complex Pipeline Control
- Challenges
- Structural conflicts at ex. stage: FPU/memory partially pipelined/not pipelined and takes > 1 cycle
- Structural conflicts at wb. stage: variable latency
- Out-of-order write hazards: variable latency
- Exception handling
Complex In-Order Pipeline
- Delay wb.: all operations have same latency to wb. stage
- Write port never oversubscribed: 1 in & 1 out every cycle
- Stall pipeline on long latency operations
- Handle exceptions in-order at commit point
- Bypassing to prevent increased wb. latency to slow down single cycle integer operations
In-Order Superscalar Pipeline
- Fetch 2 instructions per cycle and issue them concurrently if 1 is integer/memory and 1 is FP
- Wider issue may increase regfile ports & bypassing cost
Reordering of Instructions
- Register v.s. memory dependence
- Register dependence determined at decode stage
- Memory dependence determined after computing address
Can swap ordering as long as no arrow pointing backwards.
- General categories
- In-Order Issue + In-Order Completion
- In-Order Issue + Out-of-Order Completion
- Out-of-Order Issue + Out-of-Order Completion
Ensuring Correct Issues without Equalizing Pipeline Depths or Bypassing
- Considerations at instruction issuing
- Is FU available?
Busy
?
- Is input data available? (RAW)
Dest
of source
- Is it safe to write destination? (WAR, WAW)
- WAR:
Src1
&Src2
of destination - WAW:
Dest
of destination
- WAR:
- Structural conflict at wb. stage?
- Is FU available?
- Entries added if no hazards
- Entries removed after wb. statge
Simplifying Data Structure (Scoreboarding)
Assuming in-order issue, then WAR won't occur -> no need Src1
& Src2
.
Avoid WAW
by ensuring at most 1 appearance of 1 register in Dest
column.
Busy[FU#]
: availability- Bits hardwired to FU's
WP[reg#]
: write pending- Bits set by issue stage & cleared by wb. stage
- FUs must carry
dest
field &valid
flag(we,ws)
Check condition | Reference |
---|---|
FU available? | Busy[FU] |
RAW? | WP[src] |
WAW? | WP[dest] |
WAR? | not possible |
- Limitations
- Prevent later instructions with no dependencies from being issued
Out-of-Order Issuing
- Issue buffer holds multiple instructions
- Decode adds instructions to buffer if no WAW or WAR
- Instructions in buffer can be issued if no RAW
- Limitations
- No significance improvements due to data hazards
- Number of instructions in pipeline limited by number of registers
Tomasulo's Algorithm
WAW
& WAR
can be removed by register renaming.
- FU buffers - reservation stations (RS)
- Contains pending operands
- Registers replaced by pointers to RS
- On-the-fly register renaming
- RSs more than registers
- Common data bus (data + source) broadcasts results to all FUs
- Update pending data
- Load/store treated as FUs with RSs
- Branch prediction allows FP ops beyond basic blocks in FP queue
- Stages
- Issue
- If RS free
- Execute
- Execute if both operands ready
- Else wait for common data bus for result
- Write result (to reorder buffer)
- Write on common data bus
- Mark RS available
- Commit (to register/memory)
- ROB stores instruction in original fetch order
- Avoids
WAW
- Precise exception -> ensures we can restart program
- Easy roll-back control for branch mispredictions
- BUT added H/W
- BUT CPI=1 barrier if not careful
- Issue
Superscalar Control Logic Scaling
- Each instruction check against
W * L
instructions -> growth in H/W ∝W * (W * L)
- In-order machines
L
related to pipeline latencies- Check done during issue (interlocks, scoreboard)
- Our-of-order machines
L
also includes time in instruction buffers- Check done by broadcasting tags to waiting instructions at write back
W
increases -> larger instruction window needed to find enough parallelism to keep machine busy -> greaterL