each core has separate register and L1 cache
cores might have separate main cache too
instructions can be executed in parallel
(other instructions don't have to wait for divide if result don't depend on divide)
arm: no ordering of memory operation (relax) x86: strict sequential memory consistency model (except for switching unrelated store and write)
Sequentially Consistent: Each thread executes in proper order, any interleaving
need proper cache/memory behavior
need proper intra-thread ordering constraints
More detail: Youtube
Task: sum 2^30 together Result:
Solution:
(not speed up, add overhead)
T: Total sequential time required p: Fraction of total that can be sped up (0 <= p <= 1) k: Speedup factor
T_k = pT/k + (1-p)T (maximum speed up when k=>inf)
Table of Content