Levels of optimization
compile, execute
processors, memory system
program performance, identify bottlenecks
improve performance without destroying code modularity
Compiler: don't improve asymptotic efficiency
register allocation
code selection. ordering, scheduling
dead code elimination
eliminating minor inefficiencies
arithmetic
Limitation:
compiler does not change program behavior
compiler does not understand contracts (when data range more restricted than type)
optimization analysis only within a procedure (new GCC analyze file)
only based on static information
Since compiler does not optimize calling procedure, because functions may invoke side-effects, loop invariant involving function calls are not optimized. Compiler treats function calls as black box.
Remedies: use inline function so that compiler can optimize
When two variable belong to the same space in memory, optimizations can be hindered because behavior of memory manipulation can change based on whether two inputs are in the same region of memory.
Remedies:
add key word restrict
:
remove aliasing out of loop:
Super-scalar Processor: can issue and execute multiple instructions in one cycle. Instructions are retried from a sequential instruction stream and are scheduled dynamically.
If an instruction takes more than one cycle, they can be scheduled.
Typical Hardware capability:
Loop unrolling: two operation at a time.
original code:
two operation per cycle:
Loop re-association:
may change result if it is floating point
Separate Accumulators: turn multiplication in loop to something like merge sort.
Programming with AVX2 (YMM Registers, SIMD Operations)
YMM Registers can hold many value in one place:
parallel calculation with SIMD:
Branch prediction default behavior: (95% accuracy)
if branch goes back, it is a loop, take the branch
forward branches not taken
Optimization
Table of Content