Lecture 002 - Parallelism and Cost Models

Concurrency and Parallelism

Threads

Threads: thread of execution

Multithreaded programs rely on a thread scheduler or scheduler for short.

Examples 1 sparc let t = spawn(lambda ().fib n) u = spawn(lambda ().fib 2n) ((), ()) = (sync t, sync u) in () end In the example above, the threads compute the desired Fibonacci number but has no way of communicating the result back..

Example 2 sparc let (r, s) = (ref 0, ref 0) t = spawn(lambda (). r <- fib n) u = spawn(lambda (). s <- fib 2n) ((), ()) = (sync t, sync u) in (!r, !s) end Now we can report results back using reference.

// TODO: https://www.diderot.one/courses/136/books/578/chapter/8073

Performance:

Concurrently and Parallelism

Concurrency Problem: if its specification involves multiple things happening at the same item.

Parallism Solution: An algorithm is parallel if it performs multiple tasks at the same time.

Concurrency is a property of a problem, and parallelism is a property of an implementation or a solution. Both concurrent and non-concurrent problems typically accept parallel algorithm as solution.

Sequential elision: In SPARC, tuples are sequential. But you can replace (a, b) with (a || b) (pronounced as "par") to get parallel performance.

Side effect (mutable state): will break sequential elision.

You cannot assume atomic operations. The following code will either output 1 or 2.

let x = ref 0
  ((), ()) = (x <- x + 1) || (x <- x + 1)
in
  print x
end

Data Race (determinacy race): when multiple threads accessing the same piece of data, producing non-deterministic outcome.

Why use mutable state? - impossible to avoid mutable state in hardware level - mutable state enables more efficiency use of memory

Critical Section: part of code cannot be executed by more than one thread at the same time. Must executed in mutual exclusion.

Mutual Exclusion Problem (Critical Section Problem: problem of designing algorithms or protocols for ensuring mutual exclusion.

Solving Mutual Exclusion Problem:

Nonblocking instructions can be used to implement more complex concurrent nonblocking data strictures

Compare-and-swap (cas : word ref -> word * word -> word): atomic read-modify-write instruction. In case of cas input (equal, output), it performs the following:

  1. check if the content of input (following the pointer) is equal to equal
  2. If so, write output to input
  3. Otherwise, leave input unchanged
  4. return the content of input (following the pointer)

Fetch-and-add (ffa: word ref -> word -> word): atomic read-modify-write instruction to update memory region and return the original content. In case of ffa r delta, it performs the following atomically: (where + is the addition operation on machine words)

let v = !r
  r <- !r + delta
in
  v
end

Algorithm Analysis

There are two levels of abstraction

Asymptotic Analysis

We use numeric function to capture the cost of an algorithm and we are only interested in the growth rate of the functions

Asymptotic Dominance: let f() and g() be two numeric functions. f() asymptotically dominates g() if there exists c > 0 and n_0 > 0 such that for all n > n_0, g(n) \leq c \cdot f(n). (or \lim_{n\to \infty} \frac{g(n)}{f(n)} \leq c)

Example: prove f(n) = n asymptotically dominates g(n) = \ln^k n for all k. Solution: using L'Hopital's rule $$ \begin{align} \lim_{n \to \infty} \frac{g(n)}{f(n)} &= \lim_{n \to \infty} \frac{\ln^k n}{n}\ &= \left(\lim_{n \to \infty}\frac{\ln n}{n^{1/k}}\right)^k\ &= \left(\lim_{n \to \infty}\frac{1/n}{(1/k)n^{1/k - 1}}\right)^k\ &= \left(\lim_{n \to \infty}\frac{k}{n^{1/k}}\right)^k\ &= 0\ \end{align} $$

Notations

Notations

The dominance relation is a preorder, therefore the relation is transitive.

Note that f(n) = n \sin (n) and g(n) = n \cos(n) neither dominates the other.

Common Costs:

Cost Model

Machine Based Model

Analyze cost by machine hardware execution.

Random Access Machine (RAM) model: a single processor read and write memory by instructions

Parallel Random Access Machine (PRAM): consists of p-many sequential random access machines sharing the same memory (with processor id \{0, ..., p-1\})

Language Based Model

Analyze cost by programming language.

Work-Span Model: analyze work and span that can be defined for any language.

SPARC Cost Model: klzzwxh:0023 evaluates expression klzzwxh:0025 and returns the result. Notation klzzwxh:0024 indicates all free occupance of klzzwxh:0026 in expression klzzwxh:0027 are replaced by value klzzwxh:0028

SPARC Cost Model: Eval(e) evaluates expression e and returns the result. Notation [v/x]e indicates all free occupance of x in expression e are replaced by value v

Note that all rules for work and span are the same except for (e_1 || e_2). In this course, we use the convention that parallism has to be stated explicitly using || even though it is safe to assume e_1 + e_2 is parallel.

Average Parallism: work over the span. This inform us how many processors we can use efficiently.

\bar{P} = \frac{W}{S}

Work efficient: a parallel algorithm that performs asymptotically the same work as best sequential algorithm.

Observably Work Efficient: a parallel algorithm that performs similarly as the best sequential algorithm on a single core/processor

We know that

\begin{align*} T_p \geq S \land T_p \geq \lceil\frac{W}{P}\rceil \implies T_p \geq \max\left( S, \lceil\frac{W}{P}\rceil \right) \end{align*}

T_p \geq \lceil\frac{W}{P}\rceil because \lceil\frac{W}{P}\rceil assumes no dependency.

T_p \geq S because we have limited number of processors.

In addition, for greedy scheduler, we have

T_p < \lceil\frac{W}{P}\rceil + S

The above proof is provided below

and therefore

\max \left(\lceil\frac{W}{P}\rceil, S\right) \leq T_p < \lceil\frac{W}{P}\rceil + S

and we also know that parallel time T_p

\begin{align*} T_p <& \frac{W}{P} + S\\ =& \frac{W}{P} + \frac{W}{\bar{P}} \tag{by definition}\\ =& \frac{W}{P} (1 + \frac{P}{\bar{P}})\\ \end{align*}

Observe that when \bar{P} >> P, the parallel time \max \left(\frac{W}{P}, S\right) \leq T_P \leq \frac{W}{P} which means S \to 0. Therefore \bar{P} is a good measurement of parallism.

Speedup S_p: fraction of sequential time over parallel time.

S_p = \frac{T_s}{T_p}

Scheduling itself require work, but in above analysis, we didn't consider that.

Scheduling

Symbols:

Scheduling Algorithm: an algorithm for mapping parallel task to avaliable processors.

How can we schedule with arbitrary dependency graph? You don't want the algorithm to analyze dependency graph, which is itself costly.

Greedy Scheduler: always assign task to processor and starts running immediately.

Greedy Scheduling Principle: total time is bounded by: (where W is work and S is span)

T_P < \lceil\frac{W}{P}\rceil + S \leq 2 \times \text{Optimal}

Proof: T_p < \lceil\frac{W}{P}\rceil + S

With greedy scheduler, we can build our dependence tree and observe that at each timestamp we satisfy at least one of the following two case: - all processors are busy - completing a level of dependence tree

Now, T_p is equal to the number of levels. This is bounded by all levels such that "all processors are busy" plus all levels such that "completing a level of dependence tree". There are at most \lfloor\frac{W}{P}\rfloor many levels that satisfy "all processors are busy" and at most S many levels that satisfy "completing a level of dependence tree".

Therefore we have T_p \leq \lfloor\frac{W}{P}\rfloor + S. If you want to make the bound not as tight, it immediately follows that T_p < \lceil\frac{W}{P}\rceil + S.

Complexity and Recurrence

Notational Convention

When we use notations on the left, we meant to use notations on the right.

\begin{align*} n = O(n^2) &\to n \in O(n^2)\\ f(n) = g(n) + O(n^2) &\to f(n) \in \{g(n) + h(n) : h(n) \in O(n^2)\}\\ O(n) = O(n^2) &\to O(n) \subseteq O(n^2)\\ \end{align*}

It is fine to abuse notation

We also assume letter l, m, n are arguments to the cost function and not constants.

Recurrence Conventions

We somtimes we drop trivial base cases.

We usually assume input size is in power of 2. We typically ignore floors and ceiling because they change the size of input by at most one.

Finding Closed-Form Solution for Recurrence

There are three methods:

Tree Method

Method

  1. write down a tree: how many elements in each function call
  2. what is the cost of each element
  3. what is the cost of one layer
  4. what is the cost of a general layer
  5. sum all layers together

You have learned this method in courses such as 15-150

Brick Method

Brick method is good for geometric growth and decay by a constant factor. It is a special case of tree method.

If it is leaf dominated then it is important to optimize the base case, while if it is root dominated it is important to optimize the calls to other functions used in conjunction with the recursive calls. If it is balanced, then, unfortunately, both need to be optimized.

Introduction

Method: for "root dominated" / "geometric decay"

  1. determine cost of each level
  2. the sum of all level is bounded by something
  3. use Geometric Series to give an upper bound

There are three types of dominance:

Summary

Generally, your formula look like this:

W(n) = \alpha W\left(\frac{n}{\beta}\right) + O\left(f(n)\right)

The parent layer has work: f(n). The children layer has work: \alpha f(\frac{n}{\beta}). We compare them to find out whether it is root dominated or leaf dominated or balanced.

For leaf-dominated: there are \alpha^{\log_\beta n} leaves, giving O((\alpha^{\log_\beta n})\times f(1)) = O(n^{\log_\beta \alpha}) cost of the tree.

For balanced: there are \sum_{i = 1}^{\log_\beta n} \alpha^{i} nodes, giving O(\sum_{i = 1}^{\log_\beta n} \alpha^{i} f(n/\alpha^i)) cost of the tree.

Hacks:

If it involves O(\sqrt{n}), to solve for the number of layers k. We solve 1=n^{\left(\frac{1}{2}\right)^{k}}. We got k = \log_{1/2}\left(\log_{n}1\right). But this won't get us any further. You should set n = 2^m, then define S(m) = T(2^m) and write everything in terms of S(\cdot)

Here is an example: We want to solve S(n) = 2 (\sqrt{n}) + 1. Write S(2^m) = 2 S(2^(m/2)) + 1. Define W(m) = S(2^m), then we have equation W(m) = W(m/2) + 1. We solve to get W(m) \in O(log m), therefore we know that S(2^m) \in O(\log m). Let n = 2^m \implies m = \log n. Subsitute we get S(n) \in O(\log \log n).

The solution for T(n) = T(\sqrt{n}) + 1 is O(\log \log n) (balanced).

Root Dominated

In some case, we can assume: for all layers, the cost of the parent is at least a constant factor \alpha greater than the costs of the children. Then we know the total cost is dominated by the root: (we use cost(L_x) to denote the cost of entire level x of the tree)

i.e. when v is a node, and D(v) is all the descendent of node v. We have cost(v) \geq \alpha \sum_{u \in D(v)} C(u) where \alpha \geq 1.

\begin{align*} cost(L) =& cost(L_0) + cost(L_1) + ... + cost(L_d)\\ \leq& cost(L_0) + \alpha cost(L_1) + ... + \alpha^d cost(L_d)\\ \leq& (1 + \alpha + \alpha^2 + ... + \alpha^d) cost(L_0)\\ \leq& \frac{1}{1 - \alpha} cost(L_0)\\ \in& O(cost(L_0))\\ \end{align*}
Leaf Dominated

Suppose the cost of a vertex cost(v) is always lower or equal to \frac{1}{\alpha} \times cost(children(v)) for some \alpha \geq 1. Therefore we know that each layer down is greater cost than each layer up.

cost(v) \leq \frac{1}{\alpha} \sum_{u \in D(v)} cost(u)

In this case, the total cost is bounded by \frac{\alpha}{\alpha - 1} times the total cost of the leaves.

For all node v with input size greater than n_0 (for constant 0 \leq n_0 \leq 1), the overall cost is O(cost(base)). The reason why we need this constraint is because we do not account for the base-case, but only accounted for the base level of recursion-case. To prove it formally, we need to add the cost of base-case.

The overall proof is about the same as Root Rominated. But there are more technical details to it. So consider the exact number of elements in the leaves, and you need to obtain the cost of leaf layer.

Conting leaves for W(n) = \alpha W(\frac{n}{b}) + ...

Counting leaves can be tricky. Consider W(n) = W(\frac{n}{2}) + W(\frac{n}{3} + \sqrt{n}). Then we can count the number of leaves using another recurrence:

L(n) = \text{\# of leaves} = \begin{cases} 1 & \text{if } n \leq 1\\ L(\frac{n}{2}) + L(\frac{n}{3}) & \text{otherwise}\\ \end{cases}

The above method can be continued in substitution method. Since we got another recurrence... We need to use substitution method instead.

Balanced

For all recurrence, the overall cost of the recurrence is at most the number of layers times the maximum cost of all layers.

The above sentence gives us a good bound because for a balanced case, the cost of each level is about the same. Therefore, we just multiply the maximum cost of all level by the number of levels.

Substitution Method

In some leaf-dominated recurrences, not all leaves are at the same level: W(n) = W(n/2) + W(n/3) + \sqrt{n}. In this case, we need to guess a function that satisfy L(n) = L(\frac{n}{2}) + L(\frac{n}{3}) (the equation above) where L(n) denotes the number of leaves at level n. If we assume (a.k.a. guess) it to be in the form of L(n) = n^\beta, we can calculate \beta \simeq 0.778. Therefore W(n) \in O(n^{0.778}).

Base Case: (\forall b) (L(1) = 1^b = 1)

Inductive Case:

\begin{align*} L(n) =& L(\frac{n}{2}) + L(\frac{n}{3})\\ n^b =& \left(\frac{n}{2}\right)^b + \left(\frac{n}{3}\right)^b\text{by induction hypothesis}\\ n^b =& n^b\left(\left(\frac{1}{2}\right)^b + \left(\frac{1}{3}\right)^b\right)\\ 1 =& \left(\left(\frac{1}{2}\right)^b + \left(\frac{1}{3}\right)^b\right)\\ b =& 0.788\\ \end{align*}

Note that when guessing the function, the constent \sqrt{n} does not really matter.

Table of Content