Lecture 011 - Graphs, PFS, BFS, DFS

Graphs

Undirected Graph: can be represented as directed graphs where each edge is replaced with two edges in opposite direction.

Transpose: flip the direction of every edge.

Example: enumerable graph representation

Representation of Visited Set: because visited set is used sequentially, we can represent it with an ephemeral way with update and nth constant work span.

For MST, only the order of edge weights matter. So adding, multiplying, power with a constant to every edge will generate the same MST. But for shortest path, we can only multiply edge weight by some constant (power and add constant does not work).

Ephemeral and Single-Threaded Sequences

Persist Datastructure: A data structure such that all operations are to generate new data without modifying the original input data. Think about const keyword in C++ where all inputs are non-mutable.

safe for parallism, since no data can be modified and therefore only concurrent read can happen
update, inject will require $\Omega(|a|)$ if using array sequence implementation or $\Omega(\log |a|)$ using tree sequence implementation. They are slow.

Ephemeral Data Structures: A data structure such that all operations are modify the original data structure without outputting anything or output a data structure but invalidate the original data structure.

sequential algorithm can use ephemeral data structure safely.
you should not re-use datastructure input that passed into a ephemeral datastructure's operations.

Ephemeral Sequences: for a sequence of length $n$

update: $O(1)$ work span
inject: $O(n)$ work, $O(\log n)$ span
ninject: $O(n)$ work, $O(1)$ span

STSequence is persistent, but it uses benign effects internally. How is it possible? Well, the above cost bound only holds if you use the lastest version of the datastructure. The cost bound changes when you try to access an earlier version.

Graph Search

There are Priority-First Search (PFS), Breath-First Search (BFS), Depth-First Search (DFS).

Source: starting vertex of search

Out neighbors: out neighbors of vertex $v$ in graph $G$ is $N_G^+(v)$ .

Frontier Set: set of un-visited out-neighbors $N_{G}^+(X) - X$ where $X$ is visited set.

Generic graph search from single source: initialize a vertex stack $F$ with initial vertex source $s$ , when visit, append vertex to $X$ . $U$ is a (potentially singleton) set choosing from $F$ , depending on specific algorithm.

graphSearch(G, s) =
  let
    explore X F =
      if |F| = 0 then X
      else let
        choose U in F (* choose a vertex in unvisited *)
        visit U
        X = append(X, U) (* update visited *)
        F = neighbore(X) - X (* update unvisited stack *)
      in
        explore X F
      end
  in
    explore {} {s}
  end

Note that above algorithm does not visit all the vertices in the graph, especially when there is no path.

Above graph search is generic one, depending how you choose $U$ , we can build BFS, DFS, PFS... If $U$ is a set, BFS can be parallel, but DFS must be sequential.

Graph Reachability Problem: for a grpah $G = (V, E)$ and vertex $v \in V$ , return all vertices $U \subseteq V$ that are reachable from $v$ . Graph search solves reachability.

For undirected graph, graph reachability is the same as finding connected component that contain $v$ . But this algorithm is sequential, we can do it in parallel using graph contraction.

Priority-First Search (PFS)

Used to implement Breadth-First Search, Dijkstra's algorithm and Prim's algorithm.

Options to pick set of vertices $U$ to visit

the highest priority vertex, breaking ties arbitrarily
all highest priority vertices, or
all vertices close to being the highest priority, perhaps the top $k$ (this is beam search)

PFS is a greedy algorithm

Breadth-First Search (BFS)

BFS can be used:

finding shortest (unweighted path)
determing if graph is bipartite
bounding diameter of undirected graph
partitioning graphs
used in Ford-Fulkerson's algorithm

Distance: distance $\delta_G(s, v)$ from $s$ to $v$ is the length of shortest path connecting $s$ to $v$

Here is a sequential BFS:

BFSReach (G = (V, E)) s = let
  explore X F i =
    if |F| = 0 then (X, i)
    else let
      (u, j) = argmin_{(u, k) in F} (k) (* choose next vertex u (with depth j) such that it has smallest depth k *)
      X = append(X, u) (* mark vertex u as visited *)
      F = remove(F, (u, j))
      F = append(F, (v, j+1) | v in N_G^+(u) and v not in X and (v, _) not in F) (* append unvisited out neighbores of visited vertex u to stack *)
    in explore X F j end
  in explore {} [(s, 0)] 0 end

Unlike DFS, to keep BFS data structure simple (ie. merge visited set $X$ and frontier $F$ ), we need to use priority queue that support push and pop in $O(1)$ .

Cost: since we do $|V|$ many push, and checking whether each neighbore is visited at most $|E|$ times, the sequential BFS is $O(|V| + |E|)$ work.

Parallel BFS:

BFSReach (G = (V, E), s)= let
  explore X F i =
    if |F| = 0 then (X, i - 1)
    else let
      X = append(X, F)
      F = remove(X, G_G^+(F))
    in explore X F (i + 1) end
  in explore {} {s} 0 end

There is no difference for directed graph. In both directed and undirected, when we visit vertex $v$ we don't add parent of $v$ and $v$ itself to $F$ .

Cost: the algorithm requires $O(m \log n)$ work and $O(d \log^2 n$ span (where $d$ is the largest distance of any reachable vertex from source vertex)

// TODO: Bounding Cost using Aggregation: https://www.diderot.one/courses/136/books/580/chapter/8115#atom-590479

We can also store a distance in $X$ :

To calculate shortest-path tree (where we can compute distance from $s$ to $v$ by follow path on tree), we choose either of the algorithm:

BFS Tree with Sequence
Unweighted Shorted Paths

BFS Tree with Sequence: calculate shortest paths from $v$ to any vertex $u$ in graph $G$ . What you get is many flattened version of paths (denoted as sequence of vertices) from $u$ back to $u$ .

visit frontier layer.
get all the edges from next layer to frontier layer.
flatten those edges.
for each to vertex, select at most (to, from) edge using Seq.inject to output.
update frontier layer contain unvisited nodes

BFS Tree Trace: klzzwxh:0054 means vertex klzzwxh:0055's parent is klzzwxh:0056. — BFS Tree Trace: $X_i[v] = u$ means vertex $v$ 's parent is $u$ .

// TODO: https://www.diderot.one/courses/136/books/582/chapter/8151#atom-591585

Depth-First Search (DFS)

DFS can be used:

find cycles in a graph
topologically sort a DAG
find strongly connected components
test whether a graph is bi-connected

DFSReach (G, s) = let
  DFS (X, v) =
    if v in X then X
    else iterate DFS (append(X, v)) N_G^+(v)
  in DFS ({}, s) end

Generic DFS

Here we present a generic DFS:

In generic DFS, visit, finish, revisit are user defined function that modifies user defined structure $\Sigma$ . $X$ is a boolean sequence denoting whether a vertex is visited. visit, finish, revisit must be $O(1)$ to keep $O(m+n)$ work.

If your out-neighbore include your parent, you might revisit your parent. But since the parent is already marked as visited, it will not get expanded.

DFSAll (G = (V, E)) User = Seq.iterate (DFS G) (User, {}) V

DFS Numbers and Edge Classification

DFS Numbers: When running the algorithm, we can mark each vertex the time when we first visit (visit) the vertex and the time when we finish (finish) visit the vertex. Those two numbers associated with vertex are called DFS Numbers

tree, back, forward, and cross edge, in unordered view

tree, back, forward, and cross edge. All edges appear in the original graph, we just classify them. Unlabeled edges in black are tree edges.

In addition, we can classify the edge in original graph into:

tree edge: the edge (from, to) we go through when we first visit to
back edge: the edge (from, to) we findout we already visited to that is an ancestor of from
forward edge: the edge (from, to) we findout we already visited to that is a child of from
cross edge: the edge (from, to) we findout we already visited to that is neither an ancestor or a child of from

In undirected graph, there is no forward edge and no cross edge. Only tree edge and back edge are possible.

In addition, we can just look at DFS numbers to classify edges:

Costs

Costs:

we make $|E| + |V| = m + n$ calls to DFS where $|E|$ come from calls by DFS itself and $|V|$ comes from calls by DFSall.
visit and finish: is called $|V|$ times
revisit is called $(|E| + |V|) - |V| = |E|$ times
we check whether v is in X $|E| + |V|$ times
we insert v to X $|V|$ times
For a tree-based implementation of sets and an adjacency table representation of graphs all operations take $O(\log |V|)$ . The total work, assuming user defined functions are $O(\log n)$ , then the total cost is $O((m+n)\log n)$
But using ephemeral array sequences for $X$ , and adjacency sequences for the graphs giving $O(1)$ work per operation. The total work, assuming user defined functions are $O(1)$ , then the total cost is $O(m+n)$

Cycle Detection

From classification, we know:

forward edge: don't create cycle
cross edge: don't create cycle
back edge: create cycle

To find cycle, there are two methods:

generate DFS number can check for back edge
check for back edge directly using a flag and ancestor stack

Directed Graph Only: Cycle Detection using Generic DFS

Applying the cycle-detection algorithm to undirected graphs does not work, because it will find that every edge forms a cycle of length two. So in addition, we need to check make sure back edge don't go to direct ancestor (ie. parent)

Topological Sort

Directed Acyclic Graph (DAG) obeys:

transitivity: $a \leq b \land b \leq c \implies a \leq c$ , reachability
antisymmetry: $\lnot (a \leq b \land b \leq a$ , this guarantees no cycles

Partial order allow unordered (neither $a \leq b \lor b \leq a$ satisfy) - this is the case when two nodes are not reachable.

To reach total order, you have two choices:

You can always pick a subset of vertex that satisfy total order.
Assign arbitrary order to unordered elements (This is what we use for topological sort)

Topological Sort: is a total ordering on DAG such that $(\forall (v_i, v_j) \in E)(i < j)$ . There might be many possible topological sort for a graph.

If we order vertices from highest finish time to lowest finish time (decreasingFinish), we obtain a topological sorted sequence.

Strongly Connected Components (SCC)

Strongly Connected Graph: a directed graph is strongly connected if all vertices can reach (not necessarily directly connect to) each other.

Strongly Connected Components: a subgraph $H$ of $G$ that is strongly connected graph and maximal (ie. adding more vertices and edges from $G$ into $H$ will break strong connectivity of $H$ ).

Components DAG: contracting strongly connected components in a graph into a vertex and eliminating duplicate edges between components.

Strongly Connected Components (left) and Component DAG (right)

Strongly Connected Components (SCC) problem: find the strongly connected components of a graph and returning them in topological order.

For example, we need to return $[\{c, f, d\}, \{a\}, \{e, b\}]$ for above graph.

Algorithm:

we first sort the entire graph in topological order using decreasingFinish using DFSAll. This topological order is also the topological order of strongly connected components
We transpose the graph (flip every edge)
We start many instances of DFSReach in transposed graph with topological order until graph is traversed.

SCC (G = (V, E)) = let
  F = decreasingFinish G
  G^T = transpose G

  SCCOne ((X, comps), v) = let
    (X', comp) = DFSReach G^T (X, v)
  in
    (X', comp::comps) (* here: you can check for empty comp if you want *)
  end
in
  iterate SCCOne ({}, []) F
end

Parallel DFS

Making DFS parallel is hard. Depth-first search is known to be P-complete, a class of computations that can be done in polynomial work but are widely believed not to admit a polylogarithmic span algorithm. A detailed discussion of this topic is beyond the scope of this book, but it provides evidence that DFS is unlikely to be highly parallel.

Why DFS is good:

frontier of BFS is memory consuming
for some large graph (that can't fit into memory, e.g. robot motion planning), computing frontier is infeasible
DFS has better data locality

Table of Content