Lecture 009 - Treaps

Binary Search Tree

Besides insert, delete, search, we want to add:

union
intersection
difference
filter
map
reduce

Abstract Datatype (ADT): interface definition

type a' tree = Leaf | Node of (tree * a' * tree)

Here we only consider storing values on the internal nodes and assume leaves have no values associated with them.

In-order traversal: left tree, node, right tree (from left to right of x-axis when you draw tree diagram)

BST Invariant: $(\forall (L, k, R) \in T)(L < k < R)$

Pre-order traversal: node, left tree, right tree

Some basic operation on binary search tree $T$ :

dom(T): all the keys
size(T): number of keys
height(T): empty tree has height 0. A leaf has height 1. We could do this by counting layers without counting leaf layer.
depth(N/L): depth of node or leaf, started from root depth(root)=0

Leaf does not have key

Perfect Balance: A binary tree is perfectly balanced if it has the minimum possible height $\lceil\log (n+1)\rceil$ for $n$ keys.

(Nearly) Balance: the height is $\in O(\log n)$ for $n$ keys

It is impossible to have perfectly balanced tree while keeping insertion $O(\log n)$ work.

There are many balancing schemes for BSTs. Most either try to maintain height balance (the children of a node are about the same height) or weight balance (the children of a node are about the same size):

AVL: two children of each node differ in height by at most one
Red-Black: all leaves have a depth that is within a factor of 2 of each other
Weight Balanced (BBalpha): left and right subtrees of a node of size $n$ each have size $\alpha n$ for $0 < \alpha \leq 1 - \frac{1}{\sqrt{2}}$ .
Treaps: associate a random priority with every key and maintain the invariant that keys are stored in heap order with respect to their priorities
Splay Tree: amortized data structure that does not guarantee near balance, but instead guarantee that for any sequence of $m$ insert, find, and delete each does $O(\log n)$ amortized work.
Scapegoat Tree: ...
AA Trees: ...
Brother Trees: ...
B Trees: ...

Treaps is good for parallel. Some other amortized tree is hard to support parallel.

Parametric Tree

The word "parametric" means that we can subsitute different implementation of joinMid for different cost behavior of the entire datastructure while keeping the rest of the code the same.

BST Interface Functionality

Datatype of BST

empty : T
singleton : K -> T
size : T -> N
find : T -> K -> B
delete : (T * K) -> T
insert : (T * K) -> T
union : (T * K) -> T
intersection : (T * T) -> T
difference : (T * T) -> T
split : (T * K) -> (T * B * T)
joinPair : (T * T) -> T
joinM : (T * K * T) -> T
filter : (K -> bool) -> T -> T
reduce : (K * K -> K) -> K -> T -> K

size: There is a difference between a internal node and a user node. Internal node is never exposed to user, when user want to dive into node, we will remove extra information like the size stored in every node.

union: Parallel insert. union is identical to joinPair except it does not require all keys in $T_1$ smaller than $T_2$ and remove possible duplicates between both tree (although we still need both trees has unique keys within themselves).

intersection: Parallel delete

split: Split to key's left and right tree, boolean indicate whether key exist. The exact structure of the trees returned by split can differ from one implementation to another.

joinPair: Assuming all keys in $T_1$ smaller than $T_2$ , merge two trees. This is useful for implementing delete. The exact structure of the tree returned by joinPair can differ from one implementation to another: the specification only requires that the resulting tree is a valid BST

JoinM: Assuming $T_1 < k < T_2$ , merge two trees and add a key. As with joinPair the exact structure of the tree returned can differ from one implementation to another.

JoinMid: exactly the same as JoinM. Except it takes in a user's node instead of a tuple. (There is a difference between user's key and internal representation of a key. Internally, we use hashKey which is (key, hash) to implement Treap. Therefore it serves as a convertion from user's key to internal key.

Cost specification: assume $n$ denotes the maximum size of two trees and $m$ denotes the minimum size of two trees and the size of a tree $|t|$ is the number of keys in the tree, then we have:

The Cost Specification for BSTs can be realized by several balanced BST data structures such as Treaps (in expectation), red-black trees (in the worst case), and splay trees (amortized).

// TODO: https://www.diderot.one/courses/136/books/579/chapter/8100#segment-646501

Balanced Tree Interface

Interface of balanced tree contains following type and function

(* 15-210 Fall 2022 *)
(* Parametric implementation of binary search trees *)
(* Live-coded in Lecture 13, Wed Oct 12, 2022 *)
(* Starting from live code from Lecture 12 *)
(* Starting from live code from Lecture 11 *)
(* Frank Pfenning + students *)

signature KEY = sig
  type t
  val compare : t * t -> order
end

(* we want user to know type t = int *)
structure IntKey :> KEY where type t = int = struct
  type t = int
  val compare = Int.compare
end

signature ParmBST = sig
  structure K : KEY (* parameter *)
  type T (* abstract *)
  (* invariant: Node (L, k, R) then L < k < R *)
  datatype E = Leaf | Node of T * K.t * T
  val size : T -> int
  val expose : T -> E  (* exposes structure, not internal info *)
  val joinMid : E -> T (* rebalance scheme or not depend on implementation *)
end

functor Simple (structure Key : KEY) :> ParmBST where type K.t = Key.t =
struct
  structure K = Key
  datatype T = TLeaf | TNode of T * K.t * int * T
  datatype E = Leaf | Node of T * K.t * T

  fun size TLeaf = 0
    | size (TNode (L, k, s, R)) = s
  fun expose T = case T of
      TLeaf => Leaf
    | TNode (L, k, s, R) => Node (L, k, R)
  (* rebalance not yet implemented, but it just one simple implementation *)
  fun joinMid E = case E of
      Leaf => TLeaf
    | Node (L, k, R) => TNode (L, k, size L + size R + 1, R)
end

signature BST = sig
  structure K : KEY

  type T  (* abstract *)
  val empty : T
  val singleton : K.t -> T
  val joinM : (T, K.t, T) -> T
  val find : T -> K.t -> bool

  val insert : T -> K.t -> T
  val delete : T -> K.t -> T

  val split : T -> K.t -> (T, bool, T)

  val filter : (K.t -> bool) -> T -> T
  val reduce : (K.t * K.t -> K.t) -> K.t -> T -> K.t
  val union : T * T -> T
  val intersection : T * T -> T
  val difference : T * T -> T
  val toList : T -> K.t list

  (* more... *)
end

functor Bst (structure P : ParmBST) :> BST where type K.t = P.K.t = struct
  structure K = P.K
  type T = P.T
  val empty = P.joinMid (P.Leaf)
  val singleton k = P.joinMid (P.Node (P.Leaf, k, P.Leaf))
  val joinM (L, k, R) = P.joinMid (P.Node (L, k, R))

  fun find T k =
    let
      val (_, b, _) = split T k
    in
      b
    end

  (*
  fun find T k = case P.expose T of
    P.Leaf => false
  | P.Node(L, k', R) => case K.compare (k, k') of
        LESS => find L k
      | EQUAL => true
      | GREATER => find R k
  *)

  fun insert T k =
    let
      (* handles BST invariant *)
      val (L, _, R) = split T k
    in
      P.joinMid (P.Node(L, k, R))
    end

  fun delete T k =
    let
      (* handles BST invariant *)
      val (L, _, R) = split T k
    in
      joinPair (L, R)
    end

  fun split T k = case P.expose T of
    P.Leaf => (empty, false, empty)
  | P.Node (L, k', R) => case K.compare (k, k') of
      LESS => let
                val (LL, b, LR) = split L k  (* LL < k < LR *)
              in
                (LL, b, P.joinMid(P.Node(LR, k', R)))
              end
    | EQUAL => (L, true, R)
    | GREATER => let
                   val (RL, b, RR) = split R k
                 in
                   (P.joinMid(P.Node(L, k', RL)), b, RR)
                 end

  (* val joinPair : (T, T) -> T might be a internal function *)
  (* joinPair (L, R) requires L < R *)
  fun joinPair (L, R) = case P.expose L of
    P.Leaf => R
  | P.Node (LL, kL, LR) =>          (* LL < kL < LR < R *)
    let
      val T = joinPair (LR, R)      (* LL < kL < T *)
    in
      P.joinMid(P.Node(LL, kL, T))
    end

  fun filter p T = case P.expose T of
    P.Leaf => empty
  | P.Node(L, k, R) =>
    if p k then
      P.joinMid(P.Node(filter p L, k, filter p R)) (* in parallel! *)
    else
      joinPair(filter p L, filter p R)     (* in parallel! *)

  fun reduce f I T = case P.expose T of
    P.Leaf => I
  | P.Node(L, k, R) => f ((reduce f I L), f(k, reduce f I R)) (* in parallel! *)

  fun union S T = case (P.expose S, P.expose T) of
    (P.Leaf, _) => T
  | (_, P.Leaf) => S
  | (P.Node(SL, Sk, SR), _) =>       (* SL < Sk < SR *)
    let
      (* Note that the key Sk might exists in both trees
         but will only be placed in the result once,
         because the split operation will not include Sk.
         Therefore all duplicate is removed
      *)
      val (TL, b, TR) = split T Sk   (* TL < Sk < TR *)
    in
      P.joinMid(P.Node(union SL TL, Sk, union SR TR))
    end (* in parallel! *)
    (* union SL TL < Sk < union SR TR *)

  fun intersection S T = case (P.expose S, P.expose T) of
    (P.Leaf, _) => empty
  | (_, P.Leaf) => empty
  | (P.Node(SL, Sk, SR), _) =>
    let
      val (TL, b, TR) = split T Sk   (* TL < Sk < TR *)
    in
      if b then
        P.joinMid(P.Node(intersection SL TL, Sk, intersection SR TR))
      else
        joinPair(intersection SL TL, intersection SR TR)
    end (* in parallel! *)

  fun difference S T = case (P.expose S, P.expose T) of
    (P.Leaf, _) => empty
  | (_, P.Leaf) => empty
  | (P.Node(SL, Sk, SR), _) =>
    let
      val (TL, b, TR) = split T Sk   (* TL < Sk < TR *)
    in
      if b then
        joinPair(difference SL TL, difference SR TR)
      else
        P.joinMid(P.Node(difference SL TL, Sk, difference SR TR))
    end (* in parallel! *)

  (* improve complexity? *)
  fun toList T = case P.expose T of
    P.Leaf => []
  | P.Node(L, k, R) => toList L @ [k] @ toList R

end

structure Treap = TreapCore (structure HashKey = IntHashKey)
structure Bst = Bst (structure P = Treap)

(*
structure Simple = Simple (structure Key = IntKey)
structure Bst = Bst (structure P = Simple)
*)

structure Test = struct
  open Bst
  fun test () = let
    val T0 = empty
    val T1 = insert T0 5
    val T2 = insert T1 2
    val T3 = insert T2 17
    val T4 = delete T3 5
    val T5 = delete T4 13
    val T6 = insert T5 ~1
    val T7 = filter (fn x => x > 0) T6
    val T8 = union T3 T7
  in
    toList T8
  end
end

To implement balanced tree (heap invariant), we use joinMid or joinM.

To implement BST invariant, we use split and joinM on every insert and delete

To implement size efficiently, we need to store size in every node, denoting the size of the tree as if the node is a root of its subtree. This size information will be updated when JoinMid

To implement tables and dictionary, we store value in every node. Implementing sets does not require us to have value.

Treaps

You can view treap as a implementation of joinMid

type T = TLeaf | TNode of (T * K * Z * T)

Treaps: a specific parametric implementation that implements binary search tree ADT, using joinMid(). It uses a randomized priority function $p : K \to \mathbb{Z}$

BST Invariant: for every Node(L, k, R), we have $(\forall l \in L)(l < k) \land (\forall r \in R)(k < r)$ (Sorted Key From Left To Right)

Heap Invariant: for every Node(L, k, R), we have $(\forall x \in L \cup R)(p(x) > p(y))$ . (Largest Priority Goes To Root)

We usually assume that the priorities are unique unless stated otherwise. (Not necessary for the algorithm, but simplify analysis.)

Uniqueness: for any set of keys together with an unique assignment of priorities, there is exactly one tree structure that satisfies the Treap properties.

Corollary: there are $n!$ many possible tries that satisfy BST invariant (assuming there are $n$ keys total in the universe of keys).

If priorities are selected randomly, then the tree is guaranteed to be near balanced $O(\log n)$ with high probability.

Quicksort algorithm generates Treaps. sml qsTree a = if |a| = 0 then TLeaf else let k = pick a random key in a p(k) = next largest key L = {x in a | x < k} R = {x in a | x > k} (L', R') = (qsTree L) || (qsTree R) in TNode (L', k, p(k), R') end And we can prove the height of the Treap is $O(\log n)$ with high probability by isomorphism.

Implementing Treap

To turn above into a efficient tree, we simply need to use Treap to implement joinMid

signature HASH_KEY =
sig
    include KEY
    val hash : t -> int         (* not for crypto! *)
    val hashMin : int
end

structure IntHashKey :> HASH_KEY where type t = int =
struct
  type t = int
  val compare = Int.compare

  (* Knuth multiplicative hashing for p = 31, word size 32 *)
  (* The "hash function" I used in lecture was *bad* *)
  fun hash k =
      let val k32 = Word32.fromInt k
          val knuth = Word32.fromInt 2654435769
          val h32 = Word32.>>(Word32.*(k32, knuth),Word.fromInt(1))
      in Word32.toInt h32 end

  val hashMin = 0 (* to impelement leaf *)
end

functor TreapCore (structure HashKey : HASH_KEY) :> ParmBST where type K.t = HashKey.t =
struct
  structure K = HashKey

  (* LEFT, (key, priority), size, RIGHT *)
  datatype T = TLeaf | TNode of T * (K.t * int) * int * T
  datatype E = Leaf | Node of T * K.t * T

  fun size TLeaf = 0
    | size (TNode (L, (k,p), s, R)) = s

  fun prior TLeaf = K.hashMin
    | prior (TNode (L, (k,p), s, R)) = p

  fun makeNode (L, (k,p), R) = TNode(L, (k,p), size L + size R + 1, R)

  (* eventually, result will have to satisfy heap properties *)
  fun joinM (L, (k,p), R) = (* require L < k < R, *)
    let
      val Lp = prior L
      val Rp = prior R
    in
      if
        p >= Lp andalso p >= Rp
      then
        makeNode(L, (k,p), R)
      else
        if
          Lp > Rp
        then
          let
            val TNode(LL, (Lk,Lp), _, LR) = L
          in
            makeNode(LL, (Lk,Lp), joinM (LR, (k,p), R))
          end
        else (* R not a leaf, Rp >= Lp *)
          let
            val TNode(RL, (Rk,Rp), _, RR) = R
          in
            makeNode(joinM (L, (k,p), RL), (Rk,Rp), RR)
          end
    end

  fun expose T = case T of
    TLeaf => Leaf
  | TNode (L, (k,p), s, R) => Node (L, k, R)

  (* require E to be valid, satisfy BST invariant *)
  fun joinMid E = case E of
    Leaf => TLeaf
  | Node (L, k, R) => joinM (L, (k, K.hash k), R)
end

Costs:

join: The cost is bounded by the maximum of the height of two treaps. Therefore we have $O(\log |T_1| + \log |T_2|) \in O(\log (T_1 + T_2))$ with high probability.
split: The cost of each recursive call in split is constant, and the overall cost if the height of the tree which is $O(\log T)$ with high probability.

Algorithm	Average	Worst case
Space	O(n)	O(n)
Search	O(log n)	O(n)
Insert	O(log n)	O(n)
Delete	O(log n)	O(n)

Augmented Binary Search Tree

// TODO: https://www.diderot.one/courses/136/books/579/chapter/8094

Table of Content