Content
- Measurements of Programs & Computers
- Measurements
- Amdahl's Law
- Tools for Measuring
- Compiler Optimizations
- Common Compiler Optimizations
- Basics of Computer Architecture
- Pipelines
- Branch Prediction
- Out-of-Order Execution
- Superscalar & Simultaneous Multithreading(SMT)/Hyperthreading
Measurements of Programs & Computers
Measurements
- IPS: instructions per second
- FLOPS: floating point operations per second
- IPC: instructions per cycle
- CPI: cycles per instruction
Amdahl's Law
speedup = OldTime/NewTime
NewTime = OldTime x [(1 - f) + f/s]
speedup = 1 / (1 - f + f/s)
f
= fraction of execution time the optimization applies tos
= improvement factor
Tools for Measuring
- S/W timers
- C library:
<sys/times.h>
- OS-level timers:
/usr/bin/time
- C library:
- H/W timers & performance counters
- Built into processor chip: low-level architecture events e.g. cache misses, branch mispredictions, ...
- S/W packages to make them easier to use:
perf
Instrumentation
- Self implemented codes
- gprof
- Periodically interrupt program
- Measure time spent & # of calls made in each function
- gcov
- Profile of execution within a function e.g. # of times each line of code was executed, which loops/conditional statements are important
- Disturbing the system slows down execution
Comparison
| | gprof | gcov | |:---|:------|:-----| | Compile | faster (insert counter func for each function) | slower (insert counter func for each line) | | Size | smaller | bigger | | Runtime | slower (frequent interrupts) | faster |
Compiler Optimizations
- Preserve correctness
- Improve performance
- Fewer CPI
- Schedule instructions
- Improve cache/memory behavior
- Fewer instructions
- Fewer CPI
- Worth the effort
---> Front End ---> Optimizer ---> Code Generator --->
HLL IR IR LLL
- Limitations
- Cannot change program behavior
- Inter-procedural analysis too expensive -> most analysis performed within procedures
- Hard to anticipate run-time inputs -> most analysis based on static information
- Must be conservative
- Programmers should:
- Select best algorithm
- Write readable & maintainable code
- Eliminate optimization blockers
- Memory aliasing
- 2 different memory references specify single location
- Procedural side-effects
- Memory aliasing
Common Compiler Optimizations
- Machine independent
- Constant propagation
- Constant folding
- Common subexpression elimination
- Dead code elimination
- Loop invariant code motion
- Function inlining
- Space-speed tradeoff
- Overloading issue
- Reduction in strength: replace costly operation with simpler one
- Shifting (machine-dependent)
- Make use of registers
- Direct access
- Avoid bound checking
- Machine dependent
- Instruction scheduling
- Loop unrolling
- Enables more aggressive instruction scheduling
- Parallel unrolling
Basics of Computer Architecture
- RISC: simpler instructions -> pipeline
- 64-bit: larger databases, counter not overflowing, better math performance -> code size increases
Pipelines
- Important factors
- Branch prediction
- Data dependency
- Pipeline deeper -> misprediction penalty larger
- Multiple instruction issue: flushing & refetching more instructions
- OOP: More indirect branches to be predicted
Branch Prediction
- Predict future based on history
- Global predictor: predict based on global & local history
- (m, n) prediction: m-bit global x n-bit local
Out-of-Order Execution
- Solve data dependency
- Mask cache miss delay
Superscalar & Simultaneous Multithreading(SMT)/Hyperthreading
1 1 2 1 2 6
2 3 4 5 3 4 5
3 6 7 8
4 7 8 9
5 9
6
7
8
9
single-issue superscalar out-of-order superscalar
1 1' 1 2 6 1 2 1'
2 2' 3 4 5 2'6 6'
3 3' 7 8 3'3 5
4 4' 9 4 4'5'
5 5' 1'2'6' 7 8 7'
6 6' 3'4'5' 9 8'
7 7' 7'8' 9'
8 8' 9'
9 9'
2 applications fast hyperthreading
context switching