Threads: thread of execution
spawn
: takes an expression, creates a thread to concurrently execute that expression, and returns the thread
syn
: takes a thread and waits until that thread completes its execution (blocking)
Multithreaded programs rely on a thread scheduler or scheduler for short.
Examples 1
sparc let t = spawn(lambda ().fib n) u = spawn(lambda ().fib 2n) ((), ()) = (sync t, sync u) in () end
In the example above, the threads compute the desired Fibonacci number but has no way of communicating the result back..Example 2
sparc let (r, s) = (ref 0, ref 0) t = spawn(lambda (). r <- fib n) u = spawn(lambda (). s <- fib 2n) ((), ()) = (sync t, sync u) in (!r, !s) end
Now we can report results back using reference.
// TODO: https://www.diderot.one/courses/136/books/578/chapter/8073
Performance:
R^*: troughput of fast sequential baseline
R_1: throughput on 1 processor
R_P: throughput on P processors
T^*: time of fastest sequential
T_P: time of parallel on P processors
Work-Efficiency: R_1 / R^*
Self-Speedup: R_P / R_1, T_1 / T_P
Parallel throughput: R_P
Overhead: T_1 / T^*
Concurrency Problem: if its specification involves multiple things happening at the same item.
Parallism Solution: An algorithm is parallel if it performs multiple tasks at the same time.
Concurrency is a property of a problem, and parallelism is a property of an implementation or a solution. Both concurrent and non-concurrent problems typically accept parallel algorithm as solution.
Sequential elision: In SPARC, tuples are sequential. But you can replace (a, b)
with (a || b)
(pronounced as "par") to get parallel performance.
Side effect (mutable state): will break sequential elision.
You cannot assume atomic operations. The following code will either output 1
or 2
.
let x = ref 0
((), ()) = (x <- x + 1) || (x <- x + 1)
in
print x
end
Data Race (determinacy race): when multiple threads accessing the same piece of data, producing non-deterministic outcome.
Why use mutable state? - impossible to avoid mutable state in hardware level - mutable state enables more efficiency use of memory
Critical Section: part of code cannot be executed by more than one thread at the same time. Must executed in mutual exclusion.
Mutual Exclusion Problem (Critical Section Problem: problem of designing algorithms or protocols for ensuring mutual exclusion.
Solving Mutual Exclusion Problem:
Spin Locks: "busy wait" (while
loop checking if the condition is true) until critical section is "clear" of other threads
Blocking Locks: When the critical section is clear, the blocked thread receive a signal, allowing it to process. (mutex refers mostly to blocking locks)
Atomic read-modify-write instructions (nonblocking operations): can read and modify the contents of a memory location atomically, allowing a thread to operate safely on shared data. (implemented directly in hardware)
Nonblocking instructions can be used to implement more complex concurrent nonblocking data strictures
Compare-and-swap (cas : word ref -> word * word -> word
): atomic read-modify-write instruction. In case of cas input (equal, output)
, it performs the following:
input
(following the pointer) is equal to equal
output
to input
input
unchangedinput
(following the pointer)Fetch-and-add (ffa: word ref -> word -> word
): atomic read-modify-write instruction to update memory region and return the original content. In case of ffa r delta
, it performs the following atomically: (where +
is the addition operation on machine words)
let v = !r
r <- !r + delta
in
v
end
There are two levels of abstraction
Asymptotic Analysis: general time an operation require
Cost Model: machine-based and language-based cost models
We use numeric function to capture the cost of an algorithm and we are only interested in the growth rate of the functions
Asymptotic Dominance: let f() and g() be two numeric functions. f() asymptotically dominates g() if there exists c > 0 and n_0 > 0 such that for all n > n_0, g(n) \leq c \cdot f(n). (or \lim_{n\to \infty} \frac{g(n)}{f(n)} \leq c)
Example: prove f(n) = n asymptotically dominates g(n) = \ln^k n for all k. Solution: using L'Hopital's rule $$ \begin{align} \lim_{n \to \infty} \frac{g(n)}{f(n)} &= \lim_{n \to \infty} \frac{\ln^k n}{n}\ &= \left(\lim_{n \to \infty}\frac{\ln n}{n^{1/k}}\right)^k\ &= \left(\lim_{n \to \infty}\frac{1/n}{(1/k)n^{1/k - 1}}\right)^k\ &= \left(\lim_{n \to \infty}\frac{k}{n^{1/k}}\right)^k\ &= 0\ \end{align} $$
The dominance relation is a preorder, therefore the relation is transitive.
Note that f(n) = n \sin (n) and g(n) = n \cos(n) neither dominates the other.
Common Costs:
linear: O(n)
sublinear: o(n)
quadratic: O(n^2)
polynomial: O(n^k), for any constant k
superpolynomial: \omega(n^k), for any constant k
logarithmic: O(\lg n)
polylogarithmic: O(\lg^k n), for any constant k
exponential: O(a^n) for a > 1
Analyze cost by machine hardware execution.
Random Access Machine (RAM) model: a single processor read and write memory by instructions
we count one instruction as one unit operations
easy to translate code to
unrealistic since different instruction can have drastically different runtime (especially when considering cache collisions)
Parallel Random Access Machine (PRAM): consists of p-many sequential random access machines sharing the same memory (with processor id \{0, ..., p-1\})
different processors can execute different instructions
SIMD Model: when different processors execute same instructions on different data
Very hard to analyze for some p values that is not exactly the input size of an algorithm.
Analyze cost by programming language.
Work-Span Model: analyze work and span that can be defined for any language.
Work W(e) for expressions e
Span S(e) for expressions e
Eval(e)
evaluates expression e and returns the result. Notation [v/x]e
indicates all free occupance of x in expression e are replaced by value v
Note that all rules for work and span are the same except for (e_1 || e_2). In this course, we use the convention that parallism has to be stated explicitly using || even though it is safe to assume e_1 + e_2 is parallel.
Average Parallism: work over the span. This inform us how many processors we can use efficiently.
Work efficient: a parallel algorithm that performs asymptotically the same work as best sequential algorithm.
Observably Work Efficient: a parallel algorithm that performs similarly as the best sequential algorithm on a single core/processor
We know that
T_p \geq \lceil\frac{W}{P}\rceil because \lceil\frac{W}{P}\rceil assumes no dependency.
T_p \geq S because we have limited number of processors.
In addition, for greedy scheduler, we have
The above proof is provided below
and therefore
and we also know that parallel time T_p
Observe that when \bar{P} >> P, the parallel time \max \left(\frac{W}{P}, S\right) \leq T_P \leq \frac{W}{P} which means S \to 0. Therefore \bar{P} is a good measurement of parallism.
Speedup S_p: fraction of sequential time over parallel time.
Scheduling itself require work, but in above analysis, we didn't consider that.
Symbols:
T: number of steps
P: number of processor
W: work
S: span
Scheduling Algorithm: an algorithm for mapping parallel task to avaliable processors.
How can we schedule with arbitrary dependency graph? You don't want the algorithm to analyze dependency graph, which is itself costly.
Greedy Scheduler: always assign task to processor and starts running immediately.
Greedy Scheduling Principle: total time is bounded by: (where W is work and S is span)
Proof: T_p < \lceil\frac{W}{P}\rceil + S
With greedy scheduler, we can build our dependence tree and observe that at each timestamp we satisfy at least one of the following two case: - all processors are busy - completing a level of dependence tree
Now, T_p is equal to the number of levels. This is bounded by all levels such that "all processors are busy" plus all levels such that "completing a level of dependence tree". There are at most \lfloor\frac{W}{P}\rfloor many levels that satisfy "all processors are busy" and at most S many levels that satisfy "completing a level of dependence tree".
Therefore we have T_p \leq \lfloor\frac{W}{P}\rfloor + S. If you want to make the bound not as tight, it immediately follows that T_p < \lceil\frac{W}{P}\rceil + S.
When we use notations on the left, we meant to use notations on the right.
It is fine to abuse notation
We also assume letter l, m, n are arguments to the cost function and not constants.
We somtimes we drop trivial base cases.
We usually assume input size is in power of 2. We typically ignore floors and ceiling because they change the size of input by at most one.
There are three methods:
Tree Method
Brick Method
Substitution Method
Method
You have learned this method in courses such as 15-150
Brick method is good for geometric growth and decay by a constant factor. It is a special case of tree method.
If it is leaf dominated then it is important to optimize the base case, while if it is root dominated it is important to optimize the calls to other functions used in conjunction with the recursive calls. If it is balanced, then, unfortunately, both need to be optimized.
Method: for "root dominated" / "geometric decay"
There are three types of dominance:
root dominated: can use Brick Method
leaf dominated: can use Brick Method
balanced: can use Brick Method
other cases: cannot use Brick Method
Generally, your formula look like this:
The parent layer has work: f(n). The children layer has work: \alpha f(\frac{n}{\beta}). We compare them to find out whether it is root dominated or leaf dominated or balanced.
For leaf-dominated: there are \alpha^{\log_\beta n} leaves, giving O((\alpha^{\log_\beta n})\times f(1)) = O(n^{\log_\beta \alpha}) cost of the tree.
For balanced: there are \sum_{i = 1}^{\log_\beta n} \alpha^{i} nodes, giving O(\sum_{i = 1}^{\log_\beta n} \alpha^{i} f(n/\alpha^i)) cost of the tree.
Hacks:
parent layer's cost is just what's after +
children layer's cost is just subsitute what's in W(\cdot) to what's after + and times coefficient for \alpha
if there is no coefficient for \alpha, and sequence length is decreasing, then it is root dominated
if there is + 1 and there is coefficient for \alpha, then it is leaf-dominated
If it involves O(\sqrt{n}), to solve for the number of layers k. We solve 1=n^{\left(\frac{1}{2}\right)^{k}}. We got k = \log_{1/2}\left(\log_{n}1\right). But this won't get us any further. You should set n = 2^m, then define S(m) = T(2^m) and write everything in terms of S(\cdot)
Here is an example: We want to solve S(n) = 2 (\sqrt{n}) + 1. Write S(2^m) = 2 S(2^(m/2)) + 1. Define W(m) = S(2^m), then we have equation W(m) = W(m/2) + 1. We solve to get W(m) \in O(log m), therefore we know that S(2^m) \in O(\log m). Let n = 2^m \implies m = \log n. Subsitute we get S(n) \in O(\log \log n).
The solution for T(n) = T(\sqrt{n}) + 1 is O(\log \log n) (balanced).
In some case, we can assume: for all layers, the cost of the parent is at least a constant factor \alpha greater than the costs of the children. Then we know the total cost is dominated by the root: (we use cost(L_x) to denote the cost of entire level x of the tree)
i.e. when v is a node, and D(v) is all the descendent of node v. We have cost(v) \geq \alpha \sum_{u \in D(v)} C(u) where \alpha \geq 1.
Suppose the cost of a vertex cost(v) is always lower or equal to \frac{1}{\alpha} \times cost(children(v)) for some \alpha \geq 1. Therefore we know that each layer down is greater cost than each layer up.
cost(v) \leq \frac{1}{\alpha} \sum_{u \in D(v)} cost(u)
In this case, the total cost is bounded by \frac{\alpha}{\alpha - 1} times the total cost of the leaves.
For all node v with input size greater than n_0 (for constant 0 \leq n_0 \leq 1), the overall cost is O(cost(base)). The reason why we need this constraint is because we do not account for the base-case, but only accounted for the base level of recursion-case. To prove it formally, we need to add the cost of base-case.
The overall proof is about the same as Root Rominated. But there are more technical details to it. So consider the exact number of elements in the leaves, and you need to obtain the cost of leaf layer.
Conting leaves for W(n) = \alpha W(\frac{n}{b}) + ...
the branching factor is \alpha (\alpha-ary tree)
there are \log_{b}n levels
there are \alpha^{\log_{b}n} = n^{\log_b a} leaves in bottom level.
Counting leaves can be tricky. Consider W(n) = W(\frac{n}{2}) + W(\frac{n}{3} + \sqrt{n}). Then we can count the number of leaves using another recurrence:
The above method can be continued in substitution method. Since we got another recurrence... We need to use substitution method instead.
For all recurrence, the overall cost of the recurrence is at most the number of layers times the maximum cost of all layers.
The above sentence gives us a good bound because for a balanced case, the cost of each level is about the same. Therefore, we just multiply the maximum cost of all level by the number of levels.
In some leaf-dominated recurrences, not all leaves are at the same level: W(n) = W(n/2) + W(n/3) + \sqrt{n}. In this case, we need to guess a function that satisfy L(n) = L(\frac{n}{2}) + L(\frac{n}{3}) (the equation above) where L(n) denotes the number of leaves at level n. If we assume (a.k.a. guess) it to be in the form of L(n) = n^\beta, we can calculate \beta \simeq 0.778. Therefore W(n) \in O(n^{0.778}).
Base Case: (\forall b) (L(1) = 1^b = 1)
Inductive Case:
Note that when guessing the function, the constent \sqrt{n} does not really matter.
Table of Content