Lecture 014

Hashing Algorithms

Advantage: O(1) average for search, insert, delete, and O(m) space.

There are 3 types of Hashing Algorithms:

  1. Bucket Hashing with Separate Chaining
  2. Bucket Hashing with Linear Probing (or Open Addressing)
  3. Cryptographic Signature Hashing

Note that they are many advanced hashing schemes including: Bloom Filters, Cuckoo Hashing, Consistent Hashing, ...

Compare Chaining with Linear Probing

Compare Chaining with Linear Probing

Bucket Hashing with Separate Chaining

Bucket Hashing with Separate Chaining

Bucket Hashing with Separate Chaining

Definition: bucket hash function

Simple Uniform Hashing Assumption (SUHA): h is SUHA if each key k \in K has probability \frac{1}{|B|} of mapping to any bucket b \in B. ((\forall k_i \in K, b_i \in B)(Pr\{h(k_i) = b_i\} = \frac{1}{|B|})). Moreover, the hash values of different keys are independent, Pr\{h(k_1) = b_1 \cap h(k_2) = b_2 \cap ... \cap h(k_i) = b_i\} = \frac{1}{n^i}.

Note that h(k) is deterministic, not probabilistic (Pr\{h(k) = b\} = \frac{1}{n} is not true). To resolve this issue, we need a universal family of hashing functions H = \{h_1, h_2, ..., h_n\}. A random hashing function h_i \in H is chosen for the specific instance of hash table.

What is E[B_i]? Let I_k be indicator random variable that key k maps to bucket i, then:

E[B_i] = \sum_{k = 1}^m E[I_k] = \sum_{k = 1}^m \frac{1}{n} = \frac{m}{n} = \alpha

You can also conclude this by knowing that B_i \sim \text{Binomial}(m, \frac{1}{n})

When m is high, p = \frac{1}{n} is low, then B_i \sim \text{Binomial}(m, \frac{1}{n}) \simeq \text{Poisson}(mp) = \text{Poisson}(\alpha) with E[B_i] \simeq \alpha, Var(B_i) \simeq \alpha

// TODO: exit exercise 4.4 to include binomial's relation with Poisson

When \alpha is high, B_i \sim \text{Binomial}(m, \frac{1}{n}) \simeq \text{Poisson}(\alpha) \simeq \text{Normal}(\alpha, \alpha)

If the number of bucket n equals the number of keys m, then as we showed in the last section, with high probability, B_i \in O(\frac{\ln n}{\ln \ln n}).

If the number of buckets n is smaller than the number of keys m, and if m \geq 2n\ln n, with high probability (Pr \geq 1 - \frac{1}{n}), (\forall i)(|B_i| < e \alpha).

Proof: Here is what we want to show

\begin{align*} Pr\{\forall i (B_i < e\alpha)\} \geq& 1 - \frac{1}{n}\\ Pr\{\exists i (B_i \geq e\alpha)\} \leq& \frac{1}{n} \tag{negation}\\ nPr\{B_i \geq e\alpha\} \frac{1}{n} \tag{Union Bound}\\ Pr\{B_i \geq e\alpha\} \frac{1}{n^2}\\ \end{align*}

Notice \alpha = \frac{m}{n} \geq 2\ln n > 1, therefore \alpha \in \Omega(\ln n) \not\in \Theta(n)

\begin{align*} &Pr\{B_i \geq e\alpha\}\\ =& Pr\{B_i \geq (1 + (e - 1)\alpha)\} \tag{Notice $\epsilon = e - 1 > 0, \mu = \alpha > 1$}\\ <& \left(\frac{e^\epsilon}{(1 + \epsilon)^{1 + \epsilon}}\right) \tag{by Ugly Chernoff Bound}\\ =& (e^{-1})^\alpha\\ <& (e^{-1})^{2\ln n}\\ =& \frac{1}{n^2}\\ \end{align*}

Disadvantage of Separate Chaining:

Bucket Hashing with Linear Probing

Bucket Hashing with Linear Probing

Bucket Hashing with Linear Probing

probing: searching through alternative locations in the array (the probe sequence) until either the target record is found, or an unused array slot is found.

In the case of linear probing, our probing sequence, for a particular key k is:

\langle{h(k) \mod n, h(k) + 1 \mod n, h(k) + 2 \mod n, ..., h(k) + n - 1 \mod n}\rangle

We need n > m to use linear probing. Typically, we use n > 2m in implementation.

Insert: Given a key k

  1. Get the hash h(k) \mod n
  2. If a = h(k) \mod n occupied, search a + 1 \mod n, else (including tombstone) insert in a

Search / Delete: Given a key k

  1. Get the hash h(k) \mod n
  2. If a = h(k) \mod n occupied (not match) or tombstone, search a + 1 \mod n until match or empty cell.
  3. Create a tombstone when delete is successful
  4. We recreate the table if tombstone number is high

Disadvantage of Linear Probing:

Bucket Hashing with Open Addressing

In the case of linear probing, our probing sequence, for a particular key k is:

\langle{h(k, 0), h(k, 1), h(k, 2), ..., h(k, n - 1)}\rangle

In the idea world, the probe sequence for each key is equally likely to be assigned any one of the n! many permutations of \langle{0, 1, 2, ..., n - 1}\rangle buckets. It randomized the probing sequence with respect to each k.

Let A_i denotes the event "ith cell that we look at is occupied", we are interested in search cost X.

\begin{align*} &Pr\{X > i\}\\ =& Pr\{A_1 \cap A_2 \cap ... \cap A_i\}\\ =& Pr\{A_1\} \cdot Pr\{A_2 | A_1\} \cdot Pr\{A_3 | A_1 \cap A_2\} \cdot ... \cdot Pr\{A_i | A_1 \cap A_2 \cap ... \cap A_{i - 1}\}\\ =& \frac{m}{n} \cdot \frac{m - 1}{n - 1} \cdot \frac{m - 2}{n - 2} \cdot ... \cdot \frac{m - i + 1}{n - i + 1}\\ \leq& \alpha^i\\ \end{align*}

Now, we want to calculate E[X] using Pr\{X > i\}.

\begin{align*} &E[X]\\ =& \sum_{i = 0}^{n - 1} Pr\{X > i\}\\ =& \sum_{i = 0}^{n - 1} \alpha^i\\ \leq& \sum_{i = 0}^{n - 1} \alpha^i\\ =& \frac{1}{1 - \alpha}\\ \end{align*}

Cryptographic Signature Hashing

Definition: cryptographic signature

|U| >> |B| >> |K|, \alpha = \frac{m}{n} = \frac{|K|}{|B|} << 1

Hash Collision: when two different keys k_i, k_j has the same hash value (h(k_i) = h(k_j) given k_i \neq k_j)

Given that we hashed |K| = m many keys, how large |B| = n should be to achieve low probability of collision? Given that we hashed |K| = m many keys, what is the probability that none of their hash collide?

Let A denotes the event of "no collision". Let A_i denotes key i has different signature than all of first i - 1 many keys.

\begin{align*} A =& \cap_{i = 1}^m A_i\\ Pr\{A\} =& \Pr\{A_1\} \cdot \prod_{i = 2}^m Pr\{A_i | \bigcap_{j = 1}^{i - 1} A_j\}\\ =& 1 \cdot \prod_{i = 2}^m \left(1 - \frac{i - 1}{n}\right)\\ =& \prod_{i = 1}^{m - 1} \left(1 - \frac{i}{n}\right)\\ \leq& \prod_{i = 1}^{m - 1} e^{- \frac{i}{n}} \tag{by $1 - \frac{x}{n} \leq e^{- \frac{x}{n}}$, and this is equality with high $n$}\\ =& \exp \left(-\frac{1}{n} \sum_{i = 1}^{m - 1} i\right)\\ =& \exp \left(-\frac{m(m - 1)}{2n}\right)\\ \end{align*}

Therefore, the probability that there are no collision (given m, n) is:

p(m, n) = \prod_{i = 1}^{m - 1} \left(1 - \frac{i}{n}\right) \leq e^{-\frac{m(m - 1)}{2n}}

Note that for n >> m, the upper bound is close to equality.

p(m, n) \simeq e^{-\frac{m^2}{2n}}

The above equality implies we need m \in O(\sqrt{n}) to ensure the probability of no collision is high. In fact, in expectation, we can insert 1 + \sqrt{\frac{\pi n}{2}} many keys before a collision.

If we want e^{-\frac{m^2}{2n}} \simeq 1, then

\begin{align*} &e^{-\frac{m^2}{2n}} \simeq 1\\ \impliedby& -\frac{m^2}{2n} \simeq 0\\ \impliedby& \frac{m^2}{2^{256}} \simeq 0\\ \end{align*}

// TODO: include alternative way to calculate E[X]

Table of Content