Lecture 021

Parallel Computing Hardware

Typical Multicore Processor

Typical Multicore Processor

Out-of-Order Processor Structure

Out-of-Order Processor Structure

Hyperthreading Implementation

Hyperthreading Implementation

Cache Synchronization

Cache Synchronization Problem

Cache Synchronization Problem

Cache Synchronization Solution

Cache Synchronization Solution

Memory Consistency Models

arm: no ordering of memory operation (relax) x86: strict sequential memory consistency model (except for switching unrelated store and write)

Sequentially Consistent: Each thread executes in proper order, any interleaving

More detail: Youtube

Thread-Level Parallelism

Summation

Task: sum 2^30 together Result:

Solution:

  1. Accumulate in contiguous array elements
  2. Accumulate in spaced-apart array elements
  3. Accumulate in registers

accumulation methods

accumulation methods

spacing size vs. performance in 64 block size

spacing size vs. performance in 64 block size

register accumulation

register accumulation

Quick Sort

quick sort

quick sort

quick sort code 1

quick sort code 1
(not speed up, add overhead)

quick sort code 2

quick sort code 2

quick sort code 3

quick sort code 3

quick sort performance

quick sort performance

Amdahl’s Law

T: Total sequential time required p: Fraction of total that can be sped up (0 <= p <= 1) k: Speedup factor

T_k = pT/k + (1-p)T (maximum speed up when k=>inf)

Table of Content