Lecture 002 - Parallelism and Cost Models

Concurrency and Parallelism

Threads

Threads: thread of execution

spawn: takes an expression, creates a thread to concurrently execute that expression, and returns the thread
syn: takes a thread and waits until that thread completes its execution (blocking)

Multithreaded programs rely on a thread scheduler or scheduler for short.

Examples 1 sparc let t = spawn(lambda ().fib n) u = spawn(lambda ().fib 2n) ((), ()) = (sync t, sync u) in () end In the example above, the threads compute the desired Fibonacci number but has no way of communicating the result back..

Example 2 sparc let (r, s) = (ref 0, ref 0) t = spawn(lambda (). r <- fib n) u = spawn(lambda (). s <- fib 2n) ((), ()) = (sync t, sync u) in (!r, !s) end Now we can report results back using reference.

// TODO: https://www.diderot.one/courses/136/books/578/chapter/8073

Performance:

$R^*$ : troughput of fast sequential baseline
$R_1$ : throughput on $1$ processor
$R_P$ : throughput on $P$ processors
$T^*$ : time of fastest sequential
$T_P$ : time of parallel on $P$ processors
Work-Efficiency: $R_1 / R^*$
Self-Speedup: $R_P / R_1$ , $T_1 / T_P$
Parallel throughput: $R_P$
Overhead: $T_1 / T^*$

Concurrently and Parallelism

Concurrency Problem: if its specification involves multiple things happening at the same item.

Parallism Solution: An algorithm is parallel if it performs multiple tasks at the same time.

Concurrency is a property of a problem, and parallelism is a property of an implementation or a solution. Both concurrent and non-concurrent problems typically accept parallel algorithm as solution.

Sequential elision: In SPARC, tuples are sequential. But you can replace (a, b) with (a || b) (pronounced as "par") to get parallel performance.

Side effect (mutable state): will break sequential elision.

You cannot assume atomic operations. The following code will either output 1 or 2.

let x = ref 0
  ((), ()) = (x <- x + 1) || (x <- x + 1)
in
  print x
end

Data Race (determinacy race): when multiple threads accessing the same piece of data, producing non-deterministic outcome.

Benign Data Race: does not impact correctness of a program

Why use mutable state? - impossible to avoid mutable state in hardware level - mutable state enables more efficiency use of memory

Critical Section: part of code cannot be executed by more than one thread at the same time. Must executed in mutual exclusion.

Mutual Exclusion Problem (Critical Section Problem: problem of designing algorithms or protocols for ensuring mutual exclusion.

Solving Mutual Exclusion Problem:

Spin Locks: "busy wait" (while loop checking if the condition is true) until critical section is "clear" of other threads
Blocking Locks: When the critical section is clear, the blocked thread receive a signal, allowing it to process. (mutex refers mostly to blocking locks)
Atomic read-modify-write instructions (nonblocking operations): can read and modify the contents of a memory location atomically, allowing a thread to operate safely on shared data. (implemented directly in hardware)

Nonblocking instructions can be used to implement more complex concurrent nonblocking data strictures

Compare-and-swap (cas : word ref -> word * word -> word): atomic read-modify-write instruction. In case of cas input (equal, output), it performs the following:

check if the content of input (following the pointer) is equal to equal
If so, write output to input
Otherwise, leave input unchanged
return the content of input (following the pointer)

Fetch-and-add (ffa: word ref -> word -> word): atomic read-modify-write instruction to update memory region and return the original content. In case of ffa r delta, it performs the following atomically: (where + is the addition operation on machine words)

let v = !r
  r <- !r + delta
in
  v
end

Algorithm Analysis

There are two levels of abstraction

Asymptotic Analysis: general time an operation require
Cost Model: machine-based and language-based cost models

Asymptotic Analysis

We use numeric function to capture the cost of an algorithm and we are only interested in the growth rate of the functions

Asymptotic Dominance: let $f()$ and $g()$ be two numeric functions. $f()$ asymptotically dominates $g()$ if there exists $c > 0$ and $n_0 > 0$ such that for all $n > n_0$ , $g(n) \leq c \cdot f(n)$ . (or $\lim_{n\to \infty} \frac{g(n)}{f(n)} \leq c$ )

Example: prove $f(n) = n$ asymptotically dominates $g(n) = \ln^k n$ for all $k$ . Solution: using L'Hopital's rule $$ \begin{align} \lim_{n \to \infty} \frac{g(n)}{f(n)} &= \lim_{n \to \infty} \frac{\ln^k n}{n}\ &= \left(\lim_{n \to \infty}\frac{\ln n}{n^{1/k}}\right)^k\ &= \left(\lim_{n \to \infty}\frac{1/n}{(1/k)n^{1/k - 1}}\right)^k\ &= \left(\lim_{n \to \infty}\frac{k}{n^{1/k}}\right)^k\ &= 0\ \end{align} $$

The dominance relation is a preorder, therefore the relation is transitive.

Note that $f(n) = n \sin (n)$ and $g(n) = n \cos(n)$ neither dominates the other.

Common Costs:

linear: $O(n)$
sublinear: $o(n)$
quadratic: $O(n^2)$
polynomial: $O(n^k)$ , for any constant $k$
superpolynomial: $\omega(n^k)$ , for any constant $k$
logarithmic: $O(\lg n)$
polylogarithmic: $O(\lg^k n)$ , for any constant $k$
exponential: $O(a^n)$ for $a > 1$

Cost Model

Machine Based Model

Analyze cost by machine hardware execution.

Random Access Machine (RAM) model: a single processor read and write memory by instructions

we count one instruction as one unit operations
easy to translate code to
unrealistic since different instruction can have drastically different runtime (especially when considering cache collisions)

Parallel Random Access Machine (PRAM): consists of $p$ -many sequential random access machines sharing the same memory (with processor id $\{0, ..., p-1\}$ )

different processors can execute different instructions
SIMD Model: when different processors execute same instructions on different data
Very hard to analyze for some $p$ values that is not exactly the input size of an algorithm.

Language Based Model

Analyze cost by programming language.

Work-Span Model: analyze work and span that can be defined for any language.

Work $W(e)$ for expressions $e$
Span $S(e)$ for expressions $e$

SPARC Cost Model: klzzwxh:0023 evaluates expression klzzwxh:0025 and returns the result. Notation klzzwxh:0024 indicates all free occupance of klzzwxh:0026 in expression klzzwxh:0027 are replaced by value klzzwxh:0028 — SPARC Cost Model: `Eval(e)` evaluates expression $e$ and returns the result. Notation `[v/x]e` indicates all free occupance of $x$ in expression $e$ are replaced by value $v$

Note that all rules for work and span are the same except for $(e_1 || e_2)$ . In this course, we use the convention that parallism has to be stated explicitly using $||$ even though it is safe to assume $e_1 + e_2$ is parallel.

Average Parallism: work over the span. This inform us how many processors we can use efficiently.

$\bar{P} = \frac{W}{S}$

Work efficient: a parallel algorithm that performs asymptotically the same work as best sequential algorithm.

Observably Work Efficient: a parallel algorithm that performs similarly as the best sequential algorithm on a single core/processor

We know that

$\begin{align*} T_p \geq S \land T_p \geq \lceil\frac{W}{P}\rceil \implies T_p \geq \max\left( S, \lceil\frac{W}{P}\rceil \right) \end{align*}$

$T_p \geq \lceil\frac{W}{P}\rceil$ because $\lceil\frac{W}{P}\rceil$ assumes no dependency.

$T_p \geq S$ because we have limited number of processors.

In addition, for greedy scheduler, we have

$T_p < \lceil\frac{W}{P}\rceil + S$

The above proof is provided below

and therefore

$\max \left(\lceil\frac{W}{P}\rceil, S\right) \leq T_p < \lceil\frac{W}{P}\rceil + S$

and we also know that parallel time $T_p$

$\begin{align*} T_p <& \frac{W}{P} + S\\ =& \frac{W}{P} + \frac{W}{\bar{P}} \tag{by definition}\\ =& \frac{W}{P} (1 + \frac{P}{\bar{P}})\\ \end{align*}$

Observe that when $\bar{P} >> P$ , the parallel time $\max \left(\frac{W}{P}, S\right) \leq T_P \leq \frac{W}{P}$ which means $S \to 0$ . Therefore $\bar{P}$ is a good measurement of parallism.

Speedup $S_p$ : fraction of sequential time over parallel time.

Perfect Speedup: $S_p = P$ (speed up equal to the number of processor)

$S_p = \frac{T_s}{T_p}$

Scheduling itself require work, but in above analysis, we didn't consider that.

Scheduling

Symbols:

$T$ : number of steps
$P$ : number of processor
$W$ : work
$S$ : span

Scheduling Algorithm: an algorithm for mapping parallel task to avaliable processors.

How can we schedule with arbitrary dependency graph? You don't want the algorithm to analyze dependency graph, which is itself costly.

Greedy Scheduler: always assign task to processor and starts running immediately.

Greedy Scheduling Principle: total time is bounded by: (where $W$ is work and $S$ is span)

$T_P < \lceil\frac{W}{P}\rceil + S \leq 2 \times \text{Optimal}$

Proof: $T_p < \lceil\frac{W}{P}\rceil + S$

With greedy scheduler, we can build our dependence tree and observe that at each timestamp we satisfy at least one of the following two case: - all processors are busy - completing a level of dependence tree

Now, $T_p$ is equal to the number of levels. This is bounded by all levels such that "all processors are busy" plus all levels such that "completing a level of dependence tree". There are at most $\lfloor\frac{W}{P}\rfloor$ many levels that satisfy "all processors are busy" and at most $S$ many levels that satisfy "completing a level of dependence tree".

Therefore we have $T_p \leq \lfloor\frac{W}{P}\rfloor + S$ . If you want to make the bound not as tight, it immediately follows that $T_p < \lceil\frac{W}{P}\rceil + S$ .

Complexity and Recurrence

Notational Convention

When we use notations on the left, we meant to use notations on the right.

$\begin{align*} n = O(n^2) &\to n \in O(n^2)\\ f(n) = g(n) + O(n^2) &\to f(n) \in \{g(n) + h(n) : h(n) \in O(n^2)\}\\ O(n) = O(n^2) &\to O(n) \subseteq O(n^2)\\ \end{align*}$

It is fine to abuse notation

We also assume letter $l, m, n$ are arguments to the cost function and not constants.

Recurrence Conventions

We somtimes we drop trivial base cases.

We usually assume input size is in power of 2. We typically ignore floors and ceiling because they change the size of input by at most one.

Finding Closed-Form Solution for Recurrence

There are three methods:

Tree Method
Brick Method
Substitution Method

Tree Method

Method

write down a tree: how many elements in each function call
what is the cost of each element
what is the cost of one layer
what is the cost of a general layer
sum all layers together

You have learned this method in courses such as 15-150

Brick Method

Brick method is good for geometric growth and decay by a constant factor. It is a special case of tree method.

If it is leaf dominated then it is important to optimize the base case, while if it is root dominated it is important to optimize the calls to other functions used in conjunction with the recursive calls. If it is balanced, then, unfortunately, both need to be optimized.

Introduction

Method: for "root dominated" / "geometric decay"

determine cost of each level
the sum of all level is bounded by something
use Geometric Series to give an upper bound

There are three types of dominance:

root dominated: can use Brick Method
leaf dominated: can use Brick Method
balanced: can use Brick Method
other cases: cannot use Brick Method

Summary

Generally, your formula look like this:

$W(n) = \alpha W\left(\frac{n}{\beta}\right) + O\left(f(n)\right)$

The parent layer has work: $f(n)$ . The children layer has work: $\alpha f(\frac{n}{\beta})$ . We compare them to find out whether it is root dominated or leaf dominated or balanced.

For leaf-dominated: there are $\alpha^{\log_\beta n}$ leaves, giving $O((\alpha^{\log_\beta n})\times f(1)) = O(n^{\log_\beta \alpha})$ cost of the tree.

For balanced: there are $\sum_{i = 1}^{\log_\beta n} \alpha^{i}$ nodes, giving $O(\sum_{i = 1}^{\log_\beta n} \alpha^{i} f(n/\alpha^i))$ cost of the tree.

Hacks:

parent layer's cost is just what's after $+$
children layer's cost is just subsitute what's in $W(\cdot)$ to what's after $+$ and times coefficient for $\alpha$
if there is no coefficient for $\alpha$ , and sequence length is decreasing, then it is root dominated
if there is $+ 1$ and there is coefficient for $\alpha$ , then it is leaf-dominated

If it involves $O(\sqrt{n})$ , to solve for the number of layers $k$ . We solve $1=n^{\left(\frac{1}{2}\right)^{k}}$ . We got $k = \log_{1/2}\left(\log_{n}1\right)$ . But this won't get us any further. You should set $n = 2^m$ , then define $S(m) = T(2^m)$ and write everything in terms of $S(\cdot)$

Here is an example: We want to solve $S(n) = 2 (\sqrt{n}) + 1$ . Write $S(2^m) = 2 S(2^(m/2)) + 1$ . Define $W(m) = S(2^m)$ , then we have equation $W(m) = W(m/2) + 1$ . We solve to get $W(m) \in O(log m)$ , therefore we know that $S(2^m) \in O(\log m)$ . Let $n = 2^m \implies m = \log n$ . Subsitute we get $S(n) \in O(\log \log n)$ .

The solution for $T(n) = T(\sqrt{n}) + 1$ is $O(\log \log n)$ (balanced).

Root Dominated

In some case, we can assume: for all layers, the cost of the parent is at least a constant factor $\alpha$ greater than the costs of the children. Then we know the total cost is dominated by the root: (we use $cost(L_x)$ to denote the cost of entire level $x$ of the tree)

i.e. when $v$ is a node, and $D(v)$ is all the descendent of node $v$ . We have $cost(v) \geq \alpha \sum_{u \in D(v)} C(u)$ where $\alpha \geq 1$ .

$\begin{align*} cost(L) =& cost(L_0) + cost(L_1) + ... + cost(L_d)\\ \leq& cost(L_0) + \alpha cost(L_1) + ... + \alpha^d cost(L_d)\\ \leq& (1 + \alpha + \alpha^2 + ... + \alpha^d) cost(L_0)\\ \leq& \frac{1}{1 - \alpha} cost(L_0)\\ \in& O(cost(L_0))\\ \end{align*}$

Leaf Dominated

Suppose the cost of a vertex $cost(v)$ is always lower or equal to $\frac{1}{\alpha} \times cost(children(v))$ for some $\alpha \geq 1$ . Therefore we know that each layer down is greater cost than each layer up.

$cost(v) \leq \frac{1}{\alpha} \sum_{u \in D(v)} cost(u)$

In this case, the total cost is bounded by $\frac{\alpha}{\alpha - 1}$ times the total cost of the leaves.

For all node $v$ with input size greater than $n_0$ (for constant $0 \leq n_0 \leq 1$ ), the overall cost is $O(cost(base))$ . The reason why we need this constraint is because we do not account for the base-case, but only accounted for the base level of recursion-case. To prove it formally, we need to add the cost of base-case.

The overall proof is about the same as Root Rominated. But there are more technical details to it. So consider the exact number of elements in the leaves, and you need to obtain the cost of leaf layer.

Conting leaves for $W(n) = \alpha W(\frac{n}{b}) + ...$

the branching factor is $\alpha$ ( $\alpha$ -ary tree)
there are $\log_{b}n$ levels
there are $\alpha^{\log_{b}n} = n^{\log_b a}$ leaves in bottom level.

Counting leaves can be tricky. Consider $W(n) = W(\frac{n}{2}) + W(\frac{n}{3} + \sqrt{n})$ . Then we can count the number of leaves using another recurrence:

$L(n) = \text{\# of leaves} = \begin{cases} 1 & \text{if } n \leq 1\\ L(\frac{n}{2}) + L(\frac{n}{3}) & \text{otherwise}\\ \end{cases}$

The above method can be continued in substitution method. Since we got another recurrence... We need to use substitution method instead.

Balanced

For all recurrence, the overall cost of the recurrence is at most the number of layers times the maximum cost of all layers.

The above sentence gives us a good bound because for a balanced case, the cost of each level is about the same. Therefore, we just multiply the maximum cost of all level by the number of levels.

Substitution Method

In some leaf-dominated recurrences, not all leaves are at the same level: $W(n) = W(n/2) + W(n/3) + \sqrt{n}$ . In this case, we need to guess a function that satisfy $L(n) = L(\frac{n}{2}) + L(\frac{n}{3})$ (the equation above) where $L(n)$ denotes the number of leaves at level $n$ . If we assume (a.k.a. guess) it to be in the form of $L(n) = n^\beta$ , we can calculate $\beta \simeq 0.778$ . Therefore $W(n) \in O(n^{0.778})$ .

Base Case: $(\forall b) (L(1) = 1^b = 1)$

Inductive Case:

$\begin{align*} L(n) =& L(\frac{n}{2}) + L(\frac{n}{3})\\ n^b =& \left(\frac{n}{2}\right)^b + \left(\frac{n}{3}\right)^b\text{by induction hypothesis}\\ n^b =& n^b\left(\left(\frac{1}{2}\right)^b + \left(\frac{1}{3}\right)^b\right)\\ 1 =& \left(\left(\frac{1}{2}\right)^b + \left(\frac{1}{3}\right)^b\right)\\ b =& 0.788\\ \end{align*}$

Note that when guessing the function, the constent $\sqrt{n}$ does not really matter.

Table of Content