Besides insert
, delete
, search
, we want to add:
union
intersection
difference
filter
map
reduce
Abstract Datatype (ADT): interface definition
type a' tree = Leaf | Node of (tree * a' * tree)
Here we only consider storing values on the internal nodes and assume leaves have no values associated with them.
In-order traversal: left tree, node, right tree (from left to right of x-axis when you draw tree diagram)
Pre-order traversal: node, left tree, right tree
Some basic operation on binary search tree T:
dom(T)
: all the keys
size(T)
: number of keys
height(T)
: empty tree has height 0
. A leaf has height 1
. We could do this by counting layers without counting leaf layer.
depth(N/L)
: depth of node or leaf, started from root depth(root)=0
Leaf does not have key
Perfect Balance: A binary tree is perfectly balanced if it has the minimum possible height \lceil\log (n+1)\rceil for n keys.
(Nearly) Balance: the height is \in O(\log n) for n keys
It is impossible to have perfectly balanced tree while keeping insertion O(\log n) work.
There are many balancing schemes for BSTs. Most either try to maintain height balance (the children of a node are about the same height) or weight balance (the children of a node are about the same size):
AVL: two children of each node differ in height by at most one
Red-Black: all leaves have a depth that is within a factor of 2 of each other
Weight Balanced (BBalpha): left and right subtrees of a node of size n each have size \alpha n for 0 < \alpha \leq 1 - \frac{1}{\sqrt{2}}.
Treaps: associate a random priority with every key and maintain the invariant that keys are stored in heap order with respect to their priorities
Splay Tree: amortized data structure that does not guarantee near balance, but instead guarantee that for any sequence of m insert, find, and delete each does O(\log n) amortized work.
Scapegoat Tree: ...
AA Trees: ...
Brother Trees: ...
B Trees: ...
Treaps is good for parallel. Some other amortized tree is hard to support parallel.
The word "parametric" means that we can subsitute different implementation of joinMid
for different cost behavior of the entire datastructure while keeping the rest of the code the same.
Datatype of BST
empty : T
singleton : K -> T
size : T -> N
find : T -> K -> B
delete : (T * K) -> T
insert : (T * K) -> T
union : (T * K) -> T
intersection : (T * T) -> T
difference : (T * T) -> T
split : (T * K) -> (T * B * T)
joinPair : (T * T) -> T
joinM : (T * K * T) -> T
filter : (K -> bool) -> T -> T
reduce : (K * K -> K) -> K -> T -> K
size
: There is a difference between a internal node and a user node. Internal node is never exposed to user, when user want to dive into node, we will remove extra information like the size stored in every node.
union
: Parallel insert. union
is identical to joinPair
except it does not require all keys in T_1 smaller than T_2 and remove possible duplicates between both tree (although we still need both trees has unique keys within themselves).
intersection
: Parallel delete
split
: Split to key's left and right tree, boolean indicate whether key exist. The exact structure of the trees returned by split can differ from one implementation to another.
joinPair
: Assuming all keys in T_1 smaller than T_2, merge two trees. This is useful for implementing delete
. The exact structure of the tree returned by joinPair
can differ from one implementation to another: the specification only requires that the resulting tree is a valid BST
JoinM
: Assuming T_1 < k < T_2, merge two trees and add a key. As with joinPair
the exact structure of the tree returned can differ from one implementation to another.
JoinMid
: exactly the same as JoinM
. Except it takes in a user's node instead of a tuple. (There is a difference between user's key and internal representation of a key. Internally, we use hashKey
which is (key, hash)
to implement Treap. Therefore it serves as a convertion from user's key to internal key.
Cost specification: assume n denotes the maximum size of two trees and m denotes the minimum size of two trees and the size of a tree |t| is the number of keys in the tree, then we have:
The Cost Specification for BSTs can be realized by several balanced BST data structures such as Treaps (in expectation), red-black trees (in the worst case), and splay trees (amortized).
// TODO: https://www.diderot.one/courses/136/books/579/chapter/8100#segment-646501
Interface of balanced tree contains following type and function
(* 15-210 Fall 2022 *)
(* Parametric implementation of binary search trees *)
(* Live-coded in Lecture 13, Wed Oct 12, 2022 *)
(* Starting from live code from Lecture 12 *)
(* Starting from live code from Lecture 11 *)
(* Frank Pfenning + students *)
signature KEY = sig
type t
val compare : t * t -> order
end
(* we want user to know type t = int *)
structure IntKey :> KEY where type t = int = struct
type t = int
val compare = Int.compare
end
signature ParmBST = sig
structure K : KEY (* parameter *)
type T (* abstract *)
(* invariant: Node (L, k, R) then L < k < R *)
datatype E = Leaf | Node of T * K.t * T
val size : T -> int
val expose : T -> E (* exposes structure, not internal info *)
val joinMid : E -> T (* rebalance scheme or not depend on implementation *)
end
functor Simple (structure Key : KEY) :> ParmBST where type K.t = Key.t =
struct
structure K = Key
datatype T = TLeaf | TNode of T * K.t * int * T
datatype E = Leaf | Node of T * K.t * T
fun size TLeaf = 0
| size (TNode (L, k, s, R)) = s
fun expose T = case T of
TLeaf => Leaf
| TNode (L, k, s, R) => Node (L, k, R)
(* rebalance not yet implemented, but it just one simple implementation *)
fun joinMid E = case E of
Leaf => TLeaf
| Node (L, k, R) => TNode (L, k, size L + size R + 1, R)
end
signature BST = sig
structure K : KEY
type T (* abstract *)
val empty : T
val singleton : K.t -> T
val joinM : (T, K.t, T) -> T
val find : T -> K.t -> bool
val insert : T -> K.t -> T
val delete : T -> K.t -> T
val split : T -> K.t -> (T, bool, T)
val filter : (K.t -> bool) -> T -> T
val reduce : (K.t * K.t -> K.t) -> K.t -> T -> K.t
val union : T * T -> T
val intersection : T * T -> T
val difference : T * T -> T
val toList : T -> K.t list
(* more... *)
end
functor Bst (structure P : ParmBST) :> BST where type K.t = P.K.t = struct
structure K = P.K
type T = P.T
val empty = P.joinMid (P.Leaf)
val singleton k = P.joinMid (P.Node (P.Leaf, k, P.Leaf))
val joinM (L, k, R) = P.joinMid (P.Node (L, k, R))
fun find T k =
let
val (_, b, _) = split T k
in
b
end
(*
fun find T k = case P.expose T of
P.Leaf => false
| P.Node(L, k', R) => case K.compare (k, k') of
LESS => find L k
| EQUAL => true
| GREATER => find R k
*)
fun insert T k =
let
(* handles BST invariant *)
val (L, _, R) = split T k
in
P.joinMid (P.Node(L, k, R))
end
fun delete T k =
let
(* handles BST invariant *)
val (L, _, R) = split T k
in
joinPair (L, R)
end
fun split T k = case P.expose T of
P.Leaf => (empty, false, empty)
| P.Node (L, k', R) => case K.compare (k, k') of
LESS => let
val (LL, b, LR) = split L k (* LL < k < LR *)
in
(LL, b, P.joinMid(P.Node(LR, k', R)))
end
| EQUAL => (L, true, R)
| GREATER => let
val (RL, b, RR) = split R k
in
(P.joinMid(P.Node(L, k', RL)), b, RR)
end
(* val joinPair : (T, T) -> T might be a internal function *)
(* joinPair (L, R) requires L < R *)
fun joinPair (L, R) = case P.expose L of
P.Leaf => R
| P.Node (LL, kL, LR) => (* LL < kL < LR < R *)
let
val T = joinPair (LR, R) (* LL < kL < T *)
in
P.joinMid(P.Node(LL, kL, T))
end
fun filter p T = case P.expose T of
P.Leaf => empty
| P.Node(L, k, R) =>
if p k then
P.joinMid(P.Node(filter p L, k, filter p R)) (* in parallel! *)
else
joinPair(filter p L, filter p R) (* in parallel! *)
fun reduce f I T = case P.expose T of
P.Leaf => I
| P.Node(L, k, R) => f ((reduce f I L), f(k, reduce f I R)) (* in parallel! *)
fun union S T = case (P.expose S, P.expose T) of
(P.Leaf, _) => T
| (_, P.Leaf) => S
| (P.Node(SL, Sk, SR), _) => (* SL < Sk < SR *)
let
(* Note that the key Sk might exists in both trees
but will only be placed in the result once,
because the split operation will not include Sk.
Therefore all duplicate is removed
*)
val (TL, b, TR) = split T Sk (* TL < Sk < TR *)
in
P.joinMid(P.Node(union SL TL, Sk, union SR TR))
end (* in parallel! *)
(* union SL TL < Sk < union SR TR *)
fun intersection S T = case (P.expose S, P.expose T) of
(P.Leaf, _) => empty
| (_, P.Leaf) => empty
| (P.Node(SL, Sk, SR), _) =>
let
val (TL, b, TR) = split T Sk (* TL < Sk < TR *)
in
if b then
P.joinMid(P.Node(intersection SL TL, Sk, intersection SR TR))
else
joinPair(intersection SL TL, intersection SR TR)
end (* in parallel! *)
fun difference S T = case (P.expose S, P.expose T) of
(P.Leaf, _) => empty
| (_, P.Leaf) => empty
| (P.Node(SL, Sk, SR), _) =>
let
val (TL, b, TR) = split T Sk (* TL < Sk < TR *)
in
if b then
joinPair(difference SL TL, difference SR TR)
else
P.joinMid(P.Node(difference SL TL, Sk, difference SR TR))
end (* in parallel! *)
(* improve complexity? *)
fun toList T = case P.expose T of
P.Leaf => []
| P.Node(L, k, R) => toList L @ [k] @ toList R
end
structure Treap = TreapCore (structure HashKey = IntHashKey)
structure Bst = Bst (structure P = Treap)
(*
structure Simple = Simple (structure Key = IntKey)
structure Bst = Bst (structure P = Simple)
*)
structure Test = struct
open Bst
fun test () = let
val T0 = empty
val T1 = insert T0 5
val T2 = insert T1 2
val T3 = insert T2 17
val T4 = delete T3 5
val T5 = delete T4 13
val T6 = insert T5 ~1
val T7 = filter (fn x => x > 0) T6
val T8 = union T3 T7
in
toList T8
end
end
To implement balanced tree (heap invariant), we use joinMid
or joinM
.
To implement BST invariant, we use split
and joinM
on every insert
and delete
To implement size
efficiently, we need to store size
in every node, denoting the size of the tree as if the node is a root of its subtree. This size information will be updated when JoinMid
To implement tables and dictionary, we store value
in every node. Implementing sets does not require us to have value
.
You can view treap as a implementation of
joinMid
type T = TLeaf | TNode of (T * K * Z * T)
Treaps: a specific parametric implementation that implements binary search tree ADT, using joinMid()
. It uses a randomized priority function p : K \to \mathbb{Z}
BST Invariant: for every Node(L, k, R)
, we have (\forall l \in L)(l < k) \land (\forall r \in R)(k < r) (Sorted Key From Left To Right)
Heap Invariant: for every Node(L, k, R)
, we have (\forall x \in L \cup R)(p(x) > p(y)). (Largest Priority Goes To Root)
We usually assume that the priorities are unique unless stated otherwise. (Not necessary for the algorithm, but simplify analysis.)
Uniqueness: for any set of keys together with an unique assignment of priorities, there is exactly one tree structure that satisfies the Treap properties.
Corollary: there are n! many possible tries that satisfy BST invariant (assuming there are n keys total in the universe of keys).
If priorities are selected randomly, then the tree is guaranteed to be near balanced O(\log n) with high probability.
Quicksort algorithm generates Treaps.
sml qsTree a = if |a| = 0 then TLeaf else let k = pick a random key in a p(k) = next largest key L = {x in a | x < k} R = {x in a | x > k} (L', R') = (qsTree L) || (qsTree R) in TNode (L', k, p(k), R') end
And we can prove the height of the Treap is O(\log n) with high probability by isomorphism.
To turn above into a efficient tree, we simply need to use Treap to implement joinMid
signature HASH_KEY =
sig
include KEY
val hash : t -> int (* not for crypto! *)
val hashMin : int
end
structure IntHashKey :> HASH_KEY where type t = int =
struct
type t = int
val compare = Int.compare
(* Knuth multiplicative hashing for p = 31, word size 32 *)
(* The "hash function" I used in lecture was *bad* *)
fun hash k =
let val k32 = Word32.fromInt k
val knuth = Word32.fromInt 2654435769
val h32 = Word32.>>(Word32.*(k32, knuth),Word.fromInt(1))
in Word32.toInt h32 end
val hashMin = 0 (* to impelement leaf *)
end
functor TreapCore (structure HashKey : HASH_KEY) :> ParmBST where type K.t = HashKey.t =
struct
structure K = HashKey
(* LEFT, (key, priority), size, RIGHT *)
datatype T = TLeaf | TNode of T * (K.t * int) * int * T
datatype E = Leaf | Node of T * K.t * T
fun size TLeaf = 0
| size (TNode (L, (k,p), s, R)) = s
fun prior TLeaf = K.hashMin
| prior (TNode (L, (k,p), s, R)) = p
fun makeNode (L, (k,p), R) = TNode(L, (k,p), size L + size R + 1, R)
(* eventually, result will have to satisfy heap properties *)
fun joinM (L, (k,p), R) = (* require L < k < R, *)
let
val Lp = prior L
val Rp = prior R
in
if
p >= Lp andalso p >= Rp
then
makeNode(L, (k,p), R)
else
if
Lp > Rp
then
let
val TNode(LL, (Lk,Lp), _, LR) = L
in
makeNode(LL, (Lk,Lp), joinM (LR, (k,p), R))
end
else (* R not a leaf, Rp >= Lp *)
let
val TNode(RL, (Rk,Rp), _, RR) = R
in
makeNode(joinM (L, (k,p), RL), (Rk,Rp), RR)
end
end
fun expose T = case T of
TLeaf => Leaf
| TNode (L, (k,p), s, R) => Node (L, k, R)
(* require E to be valid, satisfy BST invariant *)
fun joinMid E = case E of
Leaf => TLeaf
| Node (L, k, R) => joinM (L, (k, K.hash k), R)
end
Costs:
join
: The cost is bounded by the maximum of the height of two treaps. Therefore we have O(\log |T_1| + \log |T_2|) \in O(\log (T_1 + T_2)) with high probability.
split
: The cost of each recursive call in split
is constant, and the overall cost if the height of the tree which is O(\log T) with high probability.
Algorithm | Average | Worst case |
---|---|---|
Space | O(n) | O(n) |
Search | O(log n) | O(n) |
Insert | O(log n) | O(n) |
Delete | O(log n) | O(n) |
// TODO: https://www.diderot.one/courses/136/books/579/chapter/8094
Table of Content