# Lecture 009

## Binary Search Tree

Besides insert, delete, search, we want to add:

• union

• intersection

• difference

• filter

• map

• reduce

Abstract Datatype (ADT): interface definition

type a' tree = Leaf | Node of (tree * a' * tree)


Here we only consider storing values on the internal nodes and assume leaves have no values associated with them.

In-order traversal: left tree, node, right tree (from left to right of x-axis when you draw tree diagram)

• BST Invariant: $(\forall (L, k, R) \in T)(L < k < R)$

Pre-order traversal: node, left tree, right tree

Some basic operation on binary search tree $T$:

• dom(T): all the keys

• size(T): number of keys

• height(T): empty tree has height 0. A leaf has height 1. We could do this by counting layers without counting leaf layer.

• depth(N/L): depth of node or leaf, started from root depth(root)=0

Leaf does not have key

Perfect Balance: A binary tree is perfectly balanced if it has the minimum possible height $\lceil\log (n+1)\rceil$ for $n$ keys.

(Nearly) Balance: the height is $\in O(\log n)$ for $n$ keys

It is impossible to have perfectly balanced tree while keeping insertion $O(\log n)$ work.

There are many balancing schemes for BSTs. Most either try to maintain height balance (the children of a node are about the same height) or weight balance (the children of a node are about the same size):

• AVL: two children of each node differ in height by at most one

• Red-Black: all leaves have a depth that is within a factor of 2 of each other

• Weight Balanced (BBalpha): left and right subtrees of a node of size $n$ each have size $\alpha n$ for $0 < \alpha \leq 1 - \frac{1}{\sqrt{2}}$.

• Treaps: associate a random priority with every key and maintain the invariant that keys are stored in heap order with respect to their priorities

• Splay Tree: amortized data structure that does not guarantee near balance, but instead guarantee that for any sequence of $m$ insert, find, and delete each does $O(\log n)$ amortized work.

• Scapegoat Tree: ...

• AA Trees: ...

• Brother Trees: ...

• B Trees: ...

Treaps is good for parallel. Some other amortized tree is hard to support parallel.

## Parametric Tree

The word "parametric" means that we can subsitute different implementation of joinMid for different cost behavior of the entire datastructure while keeping the rest of the code the same.

### BST Interface Functionality

Datatype of BST

empty : T
singleton : K -> T
size : T -> N
find : T -> K -> B
delete : (T * K) -> T
insert : (T * K) -> T
union : (T * K) -> T
intersection : (T * T) -> T
difference : (T * T) -> T
split : (T * K) -> (T * B * T)
joinPair : (T * T) -> T
joinM : (T * K * T) -> T
filter : (K -> bool) -> T -> T
reduce : (K * K -> K) -> K -> T -> K


size: There is a difference between a internal node and a user node. Internal node is never exposed to user, when user want to dive into node, we will remove extra information like the size stored in every node.

union: Parallel insert. union is identical to joinPair except it does not require all keys in $T_1$ smaller than $T_2$ and remove possible duplicates between both tree (although we still need both trees has unique keys within themselves).

intersection: Parallel delete

split: Split to key's left and right tree, boolean indicate whether key exist. The exact structure of the trees returned by split can differ from one implementation to another.

joinPair: Assuming all keys in $T_1$ smaller than $T_2$, merge two trees. This is useful for implementing delete. The exact structure of the tree returned by joinPair can differ from one implementation to another: the specification only requires that the resulting tree is a valid BST

JoinM: Assuming $T_1 < k < T_2$, merge two trees and add a key. As with joinPair the exact structure of the tree returned can differ from one implementation to another.

JoinMid: exactly the same as JoinM. Except it takes in a user's node instead of a tuple. (There is a difference between user's key and internal representation of a key. Internally, we use hashKey which is (key, hash) to implement Treap. Therefore it serves as a convertion from user's key to internal key.

Cost specification: assume $n$ denotes the maximum size of two trees and $m$ denotes the minimum size of two trees and the size of a tree $|t|$ is the number of keys in the tree, then we have:

The Cost Specification for BSTs can be realized by several balanced BST data structures such as Treaps (in expectation), red-black trees (in the worst case), and splay trees (amortized).

// TODO: https://www.diderot.one/courses/136/books/579/chapter/8100#segment-646501

### Balanced Tree Interface

Interface of balanced tree contains following type and function

(* 15-210 Fall 2022 *)
(* Parametric implementation of binary search trees *)
(* Live-coded in Lecture 13, Wed Oct 12, 2022 *)
(* Starting from live code from Lecture 12 *)
(* Starting from live code from Lecture 11 *)
(* Frank Pfenning + students *)

signature KEY = sig
type t
val compare : t * t -> order
end

(* we want user to know type t = int *)
structure IntKey :> KEY where type t = int = struct
type t = int
val compare = Int.compare
end

signature ParmBST = sig
structure K : KEY (* parameter *)
type T (* abstract *)
(* invariant: Node (L, k, R) then L < k < R *)
datatype E = Leaf | Node of T * K.t * T
val size : T -> int
val expose : T -> E  (* exposes structure, not internal info *)
val joinMid : E -> T (* rebalance scheme or not depend on implementation *)
end

functor Simple (structure Key : KEY) :> ParmBST where type K.t = Key.t =
struct
structure K = Key
datatype T = TLeaf | TNode of T * K.t * int * T
datatype E = Leaf | Node of T * K.t * T

fun size TLeaf = 0
| size (TNode (L, k, s, R)) = s
fun expose T = case T of
TLeaf => Leaf
| TNode (L, k, s, R) => Node (L, k, R)
(* rebalance not yet implemented, but it just one simple implementation *)
fun joinMid E = case E of
Leaf => TLeaf
| Node (L, k, R) => TNode (L, k, size L + size R + 1, R)
end

signature BST = sig
structure K : KEY

type T  (* abstract *)
val empty : T
val singleton : K.t -> T
val joinM : (T, K.t, T) -> T
val find : T -> K.t -> bool

val insert : T -> K.t -> T
val delete : T -> K.t -> T

val split : T -> K.t -> (T, bool, T)

val filter : (K.t -> bool) -> T -> T
val reduce : (K.t * K.t -> K.t) -> K.t -> T -> K.t
val union : T * T -> T
val intersection : T * T -> T
val difference : T * T -> T
val toList : T -> K.t list

(* more... *)
end

functor Bst (structure P : ParmBST) :> BST where type K.t = P.K.t = struct
structure K = P.K
type T = P.T
val empty = P.joinMid (P.Leaf)
val singleton k = P.joinMid (P.Node (P.Leaf, k, P.Leaf))
val joinM (L, k, R) = P.joinMid (P.Node (L, k, R))

fun find T k =
let
val (_, b, _) = split T k
in
b
end

(*
fun find T k = case P.expose T of
P.Leaf => false
| P.Node(L, k', R) => case K.compare (k, k') of
LESS => find L k
| EQUAL => true
| GREATER => find R k
*)

fun insert T k =
let
(* handles BST invariant *)
val (L, _, R) = split T k
in
P.joinMid (P.Node(L, k, R))
end

fun delete T k =
let
(* handles BST invariant *)
val (L, _, R) = split T k
in
joinPair (L, R)
end

fun split T k = case P.expose T of
P.Leaf => (empty, false, empty)
| P.Node (L, k', R) => case K.compare (k, k') of
LESS => let
val (LL, b, LR) = split L k  (* LL < k < LR *)
in
(LL, b, P.joinMid(P.Node(LR, k', R)))
end
| EQUAL => (L, true, R)
| GREATER => let
val (RL, b, RR) = split R k
in
(P.joinMid(P.Node(L, k', RL)), b, RR)
end

(* val joinPair : (T, T) -> T might be a internal function *)
(* joinPair (L, R) requires L < R *)
fun joinPair (L, R) = case P.expose L of
P.Leaf => R
| P.Node (LL, kL, LR) =>          (* LL < kL < LR < R *)
let
val T = joinPair (LR, R)      (* LL < kL < T *)
in
P.joinMid(P.Node(LL, kL, T))
end

fun filter p T = case P.expose T of
P.Leaf => empty
| P.Node(L, k, R) =>
if p k then
P.joinMid(P.Node(filter p L, k, filter p R)) (* in parallel! *)
else
joinPair(filter p L, filter p R)     (* in parallel! *)

fun reduce f I T = case P.expose T of
P.Leaf => I
| P.Node(L, k, R) => f ((reduce f I L), f(k, reduce f I R)) (* in parallel! *)

fun union S T = case (P.expose S, P.expose T) of
(P.Leaf, _) => T
| (_, P.Leaf) => S
| (P.Node(SL, Sk, SR), _) =>       (* SL < Sk < SR *)
let
(* Note that the key Sk might exists in both trees
but will only be placed in the result once,
because the split operation will not include Sk.
Therefore all duplicate is removed
*)
val (TL, b, TR) = split T Sk   (* TL < Sk < TR *)
in
P.joinMid(P.Node(union SL TL, Sk, union SR TR))
end (* in parallel! *)
(* union SL TL < Sk < union SR TR *)

fun intersection S T = case (P.expose S, P.expose T) of
(P.Leaf, _) => empty
| (_, P.Leaf) => empty
| (P.Node(SL, Sk, SR), _) =>
let
val (TL, b, TR) = split T Sk   (* TL < Sk < TR *)
in
if b then
P.joinMid(P.Node(intersection SL TL, Sk, intersection SR TR))
else
joinPair(intersection SL TL, intersection SR TR)
end (* in parallel! *)

fun difference S T = case (P.expose S, P.expose T) of
(P.Leaf, _) => empty
| (_, P.Leaf) => empty
| (P.Node(SL, Sk, SR), _) =>
let
val (TL, b, TR) = split T Sk   (* TL < Sk < TR *)
in
if b then
joinPair(difference SL TL, difference SR TR)
else
P.joinMid(P.Node(difference SL TL, Sk, difference SR TR))
end (* in parallel! *)

(* improve complexity? *)
fun toList T = case P.expose T of
P.Leaf => []
| P.Node(L, k, R) => toList L @ [k] @ toList R

end

structure Treap = TreapCore (structure HashKey = IntHashKey)
structure Bst = Bst (structure P = Treap)

(*
structure Simple = Simple (structure Key = IntKey)
structure Bst = Bst (structure P = Simple)
*)

structure Test = struct
open Bst
fun test () = let
val T0 = empty
val T1 = insert T0 5
val T2 = insert T1 2
val T3 = insert T2 17
val T4 = delete T3 5
val T5 = delete T4 13
val T6 = insert T5 ~1
val T7 = filter (fn x => x > 0) T6
val T8 = union T3 T7
in
toList T8
end
end


To implement balanced tree (heap invariant), we use joinMid or joinM.

To implement BST invariant, we use split and joinM on every insert and delete

To implement size efficiently, we need to store size in every node, denoting the size of the tree as if the node is a root of its subtree. This size information will be updated when JoinMid

To implement tables and dictionary, we store value in every node. Implementing sets does not require us to have value.

## Treaps

You can view treap as a implementation of joinMid

type T = TLeaf | TNode of (T * K * Z * T)


Treaps: a specific parametric implementation that implements binary search tree ADT, using joinMid(). It uses a randomized priority function $p : K \to \mathbb{Z}$

BST Invariant: for every Node(L, k, R), we have $(\forall l \in L)(l < k) \land (\forall r \in R)(k < r)$ (Sorted Key From Left To Right)

Heap Invariant: for every Node(L, k, R), we have $(\forall x \in L \cup R)(p(x) > p(y))$. (Largest Priority Goes To Root)

We usually assume that the priorities are unique unless stated otherwise. (Not necessary for the algorithm, but simplify analysis.)

Uniqueness: for any set of keys together with an unique assignment of priorities, there is exactly one tree structure that satisfies the Treap properties.

Corollary: there are $n!$ many possible tries that satisfy BST invariant (assuming there are $n$ keys total in the universe of keys).

If priorities are selected randomly, then the tree is guaranteed to be near balanced $O(\log n)$ with high probability.

Quicksort algorithm generates Treaps. sml qsTree a = if |a| = 0 then TLeaf else let k = pick a random key in a p(k) = next largest key L = {x in a | x < k} R = {x in a | x > k} (L', R') = (qsTree L) || (qsTree R) in TNode (L', k, p(k), R') end And we can prove the height of the Treap is $O(\log n)$ with high probability by isomorphism.

### Implementing Treap

To turn above into a efficient tree, we simply need to use Treap to implement joinMid

signature HASH_KEY =
sig
include KEY
val hash : t -> int         (* not for crypto! *)
val hashMin : int
end

structure IntHashKey :> HASH_KEY where type t = int =
struct
type t = int
val compare = Int.compare

(* Knuth multiplicative hashing for p = 31, word size 32 *)
(* The "hash function" I used in lecture was *bad* *)
fun hash k =
let val k32 = Word32.fromInt k
val knuth = Word32.fromInt 2654435769
val h32 = Word32.>>(Word32.*(k32, knuth),Word.fromInt(1))
in Word32.toInt h32 end

val hashMin = 0 (* to impelement leaf *)
end

functor TreapCore (structure HashKey : HASH_KEY) :> ParmBST where type K.t = HashKey.t =
struct
structure K = HashKey

(* LEFT, (key, priority), size, RIGHT *)
datatype T = TLeaf | TNode of T * (K.t * int) * int * T
datatype E = Leaf | Node of T * K.t * T

fun size TLeaf = 0
| size (TNode (L, (k,p), s, R)) = s

fun prior TLeaf = K.hashMin
| prior (TNode (L, (k,p), s, R)) = p

fun makeNode (L, (k,p), R) = TNode(L, (k,p), size L + size R + 1, R)

(* eventually, result will have to satisfy heap properties *)
fun joinM (L, (k,p), R) = (* require L < k < R, *)
let
val Lp = prior L
val Rp = prior R
in
if
p >= Lp andalso p >= Rp
then
makeNode(L, (k,p), R)
else
if
Lp > Rp
then
let
val TNode(LL, (Lk,Lp), _, LR) = L
in
makeNode(LL, (Lk,Lp), joinM (LR, (k,p), R))
end
else (* R not a leaf, Rp >= Lp *)
let
val TNode(RL, (Rk,Rp), _, RR) = R
in
makeNode(joinM (L, (k,p), RL), (Rk,Rp), RR)
end
end

fun expose T = case T of
TLeaf => Leaf
| TNode (L, (k,p), s, R) => Node (L, k, R)

(* require E to be valid, satisfy BST invariant *)
fun joinMid E = case E of
Leaf => TLeaf
| Node (L, k, R) => joinM (L, (k, K.hash k), R)
end


Costs:

• join: The cost is bounded by the maximum of the height of two treaps. Therefore we have $O(\log |T_1| + \log |T_2|) \in O(\log (T_1 + T_2))$ with high probability.

• split: The cost of each recursive call in split is constant, and the overall cost if the height of the tree which is $O(\log T)$ with high probability.

Algorithm Average Worst case
Space O(n) O(n)
Search O(log n) O(n)
Insert O(log n) O(n)
Delete O(log n) O(n)

## Augmented Binary Search Tree

// TODO: https://www.diderot.one/courses/136/books/579/chapter/8094

Table of Content