Lecture 014

Hashing Algorithms

Advantage: $O(1)$ average for search, insert, delete, and $O(m)$ space.

There are 3 types of Hashing Algorithms:

Bucket Hashing with Separate Chaining
Bucket Hashing with Linear Probing (or Open Addressing)
Cryptographic Signature Hashing

Note that they are many advanced hashing schemes including: Bloom Filters, Cuckoo Hashing, Consistent Hashing, ...

Bucket Hashing with Separate Chaining

Definition: bucket hash function

$U$ : the space of key Universe
$K \subset U$ : the actual set of keys that we use. $m = |K| << |U|$ .
$B$ : the space of buckets partitioned into $B_i \sim \text{Binomial}(m, \frac{1}{n})$ . $n = |B|$ .
$k \in U$ : single key
$h : U \rightarrow B$ : hashing function
$\alpha = \frac{m}{n}$ : load factor, how many keys per bucket on average

Simple Uniform Hashing Assumption (SUHA): $h$ is SUHA if each key $k \in K$ has probability $\frac{1}{|B|}$ of mapping to any bucket $b \in B$ . ( $(\forall k_i \in K, b_i \in B)(Pr\{h(k_i) = b_i\} = \frac{1}{|B|})$ ). Moreover, the hash values of different keys are independent, $Pr\{h(k_1) = b_1 \cap h(k_2) = b_2 \cap ... \cap h(k_i) = b_i\} = \frac{1}{n^i}$ .

Note that $h(k)$ is deterministic, not probabilistic ( $Pr\{h(k) = b\} = \frac{1}{n}$ is not true). To resolve this issue, we need a universal family of hashing functions $H = \{h_1, h_2, ..., h_n\}$ . A random hashing function $h_i \in H$ is chosen for the specific instance of hash table.

What is $E[B_i]$ ? Let $I_k$ be indicator random variable that key $k$ maps to bucket $i$ , then:

$E[B_i] = \sum_{k = 1}^m E[I_k] = \sum_{k = 1}^m \frac{1}{n} = \frac{m}{n} = \alpha$

You can also conclude this by knowing that $B_i \sim \text{Binomial}(m, \frac{1}{n})$

When $m$ is high, $p = \frac{1}{n}$ is low, then $B_i \sim \text{Binomial}(m, \frac{1}{n}) \simeq \text{Poisson}(mp) = \text{Poisson}(\alpha)$ with $E[B_i] \simeq \alpha, Var(B_i) \simeq \alpha$

// TODO: exit exercise 4.4 to include binomial's relation with Poisson

When $\alpha$ is high, $B_i \sim \text{Binomial}(m, \frac{1}{n}) \simeq \text{Poisson}(\alpha) \simeq \text{Normal}(\alpha, \alpha)$

If the number of bucket $n$ equals the number of keys $m$ , then as we showed in the last section, with high probability, $B_i \in O(\frac{\ln n}{\ln \ln n})$ .

If the number of buckets $n$ is smaller than the number of keys $m$ , and if $m \geq 2n\ln n$ , with high probability ( $Pr \geq 1 - \frac{1}{n}$ ), $(\forall i)(|B_i| < e \alpha)$ .

Proof: Here is what we want to show

$\begin{align*} Pr\{\forall i (B_i < e\alpha)\} \geq& 1 - \frac{1}{n}\\ Pr\{\exists i (B_i \geq e\alpha)\} \leq& \frac{1}{n} \tag{negation}\\ nPr\{B_i \geq e\alpha\} \frac{1}{n} \tag{Union Bound}\\ Pr\{B_i \geq e\alpha\} \frac{1}{n^2}\\ \end{align*}$

Notice $\alpha = \frac{m}{n} \geq 2\ln n > 1$ , therefore $\alpha \in \Omega(\ln n) \not\in \Theta(n)$

$\begin{align*} &Pr\{B_i \geq e\alpha\}\\ =& Pr\{B_i \geq (1 + (e - 1)\alpha)\} \tag{Notice $\epsilon = e - 1 > 0, \mu = \alpha > 1$}\\ <& \left(\frac{e^\epsilon}{(1 + \epsilon)^{1 + \epsilon}}\right) \tag{by Ugly Chernoff Bound}\\ =& (e^{-1})^\alpha\\ <& (e^{-1})^{2\ln n}\\ =& \frac{1}{n^2}\\ \end{align*}$

Disadvantage of Separate Chaining:

It requires pointer storage overhead
Memory allocation is scattered all over the memory space (not cache friendly)

Bucket Hashing with Linear Probing

probing: searching through alternative locations in the array (the probe sequence) until either the target record is found, or an unused array slot is found.

In the case of linear probing, our probing sequence, for a particular key $k$ is:

$\langle{h(k) \mod n, h(k) + 1 \mod n, h(k) + 2 \mod n, ..., h(k) + n - 1 \mod n}\rangle$

We need $n > m$ to use linear probing. Typically, we use $n > 2m$ in implementation.

Insert: Given a key $k$

Get the hash $h(k) \mod n$
If $a = h(k) \mod n$ occupied, search $a + 1 \mod n$ , else (including tombstone) insert in $a$

Search / Delete: Given a key $k$

Get the hash $h(k) \mod n$
If $a = h(k) \mod n$ occupied (not match) or tombstone, search $a + 1 \mod n$ until match or empty cell.
Create a tombstone when delete is successful
We recreate the table if tombstone number is high

Disadvantage of Linear Probing:

Linear Probing causes clustering because we always try the next cell. Clustering will slowdown insert and search.

Bucket Hashing with Open Addressing

In the case of linear probing, our probing sequence, for a particular key $k$ is:

$\langle{h(k, 0), h(k, 1), h(k, 2), ..., h(k, n - 1)}\rangle$

In the idea world, the probe sequence for each key is equally likely to be assigned any one of the $n!$ many permutations of $\langle{0, 1, 2, ..., n - 1}\rangle$ buckets. It randomized the probing sequence with respect to each $k$ .

Let $A_i$ denotes the event "ith cell that we look at is occupied", we are interested in search cost $X$ .

$\begin{align*} &Pr\{X > i\}\\ =& Pr\{A_1 \cap A_2 \cap ... \cap A_i\}\\ =& Pr\{A_1\} \cdot Pr\{A_2 | A_1\} \cdot Pr\{A_3 | A_1 \cap A_2\} \cdot ... \cdot Pr\{A_i | A_1 \cap A_2 \cap ... \cap A_{i - 1}\}\\ =& \frac{m}{n} \cdot \frac{m - 1}{n - 1} \cdot \frac{m - 2}{n - 2} \cdot ... \cdot \frac{m - i + 1}{n - i + 1}\\ \leq& \alpha^i\\ \end{align*}$

Now, we want to calculate $E[X]$ using $Pr\{X > i\}$ .

$\begin{align*} &E[X]\\ =& \sum_{i = 0}^{n - 1} Pr\{X > i\}\\ =& \sum_{i = 0}^{n - 1} \alpha^i\\ \leq& \sum_{i = 0}^{n - 1} \alpha^i\\ =& \frac{1}{1 - \alpha}\\ \end{align*}$

Cryptographic Signature Hashing

Definition: cryptographic signature

$U$ : the space of key Universe
$K \subset U$ : the actual set of keys that we use. $m = |K| << |U|$ .
$B$ : the space of all possible signature. $n = |B|$ .
$k \in U$ : single key
$h : U \rightarrow B$ : hashing function

$|U| >> |B| >> |K|, \alpha = \frac{m}{n} = \frac{|K|}{|B|} << 1$

Hash Collision: when two different keys $k_i, k_j$ has the same hash value ( $h(k_i) = h(k_j)$ given $k_i \neq k_j$ )

Hash Collision is unavoidable by Pigeon-hole Principle because $|U| >> |B|$ .

Given that we hashed $|K| = m$ many keys, how large $|B| = n$ should be to achieve low probability of collision? Given that we hashed $|K| = m$ many keys, what is the probability that none of their hash collide?

Let $A$ denotes the event of "no collision". Let $A_i$ denotes key $i$ has different signature than all of first $i - 1$ many keys.

$\begin{align*} A =& \cap_{i = 1}^m A_i\\ Pr\{A\} =& \Pr\{A_1\} \cdot \prod_{i = 2}^m Pr\{A_i | \bigcap_{j = 1}^{i - 1} A_j\}\\ =& 1 \cdot \prod_{i = 2}^m \left(1 - \frac{i - 1}{n}\right)\\ =& \prod_{i = 1}^{m - 1} \left(1 - \frac{i}{n}\right)\\ \leq& \prod_{i = 1}^{m - 1} e^{- \frac{i}{n}} \tag{by $1 - \frac{x}{n} \leq e^{- \frac{x}{n}}$, and this is equality with high $n$}\\ =& \exp \left(-\frac{1}{n} \sum_{i = 1}^{m - 1} i\right)\\ =& \exp \left(-\frac{m(m - 1)}{2n}\right)\\ \end{align*}$

Therefore, the probability that there are no collision (given $m, n$ ) is:

$p(m, n) = \prod_{i = 1}^{m - 1} \left(1 - \frac{i}{n}\right) \leq e^{-\frac{m(m - 1)}{2n}}$

Note that for $n >> m$ , the upper bound is close to equality.

$p(m, n) \simeq e^{-\frac{m^2}{2n}}$

The above equality implies we need $m \in O(\sqrt{n})$ to ensure the probability of no collision is high. In fact, in expectation, we can insert $1 + \sqrt{\frac{\pi n}{2}}$ many keys before a collision.

If we want $e^{-\frac{m^2}{2n}} \simeq 1$ , then

$\begin{align*} &e^{-\frac{m^2}{2n}} \simeq 1\\ \impliedby& -\frac{m^2}{2n} \simeq 0\\ \impliedby& \frac{m^2}{2^{256}} \simeq 0\\ \end{align*}$

// TODO: include alternative way to calculate E[X]

Table of Content