Lecture 021

Parallel Computing Hardware

each core has separate register and L1 cache
cores might have separate main cache too

instructions can be executed in parallel
(other instructions don't have to wait for divide if result don't depend on divide)

multiple instruction control can share the same functional unit

Cache Synchronization

other thread is responsible for telling the thread that is requesting changed value

Memory Consistency Models

arm: no ordering of memory operation (relax) x86: strict sequential memory consistency model (except for switching unrelated store and write)

Sequentially Consistent: Each thread executes in proper order, any interleaving

need proper cache/memory behavior
need proper intra-thread ordering constraints

More detail: Youtube

Thread-Level Parallelism

Summation

Task: sum 2^30 together Result:

no sync
- wrong answer, better result up to 8 threads
sync
- correct answer, a lot worse result

Solution:

Accumulate in contiguous array elements
Accumulate in spaced-apart array elements
Accumulate in registers

spacing the accumulators: ensure no False Sharing between threads in the same block (less miss rate in different core)

spacing size vs. performance in 64 block size

Quick Sort

(not speed up, add overhead)

Amdahl’s Law

T: Total sequential time required p: Fraction of total that can be sped up (0 <= p <= 1) k: Speedup factor

T_k = pT/k + (1-p)T (maximum speed up when k=>inf)

Table of Content