Hash function

Birthday hashing

I discuss more applications of expected values with a technique in the center of computer science called hashing.

The key principles relate closely to the birthday paradox which I'll also cover.

Mathematics - Expected value - Hashing - Computer science

Birthday hashing

A hash table is a commonly used data structure in computer science. It is one of the most useful data structures because of its speed in information retrieval. For example, suppose we want to store peoples' addresses. For each person we are tasked to store their name as the key and home address as the value. With proper notation, if $x$ is the name of a person then $h(x)$ is the index at which the home address can be found (if the hash table is implemented using an array). Nothing too special, just a function.

However, problems arise if by chance two people have the same name or hash to the same index (so $x=y$ causes $h(x)=h(y)$). Trouble starts when we attempt to store more than one "item" in the same "slot". The efficiency of all hashing algorithms depends on how often this happens. There are two common ways of hashing, either linear probing or chaining using linked lists. When using an array of linked lists, once we want to recall a person's address, we hash their name (apply $h$ to $x$) to get the index in the hash table and if there are multiple names that hash to the same key, we have to check each one of them until we find the right one, traversing the linked lists from left to right giving $\mathcal{O}(\ell)$ where $\ell$ is the length of the linked list. This is called a collision. Here is a really, really good video explaining the difference between linear probing and chaining: Hash Tables and Hash Functions

The hash function $h$ is deterministic. Every time you look up the index of a name you get the same thing back each time.

To showcase the problem with hashing, I'll start with a very similar problem. I mentioned it in a previous post here: Expected values. The question is simple (assume no person is born on February 29th):

Birthday paradox

In a set of $23$ people the probability that two share the same birthday is greater than $0.5$ (or $50$%).

This seems extremely counterintutive since there are $365$ possible birthdays and $23$ doesn't seem like enough people to be that sure. But think about the contrapositive statement. Imagine adding people in such a way that no two have the same birthday. Assume each day is equally likely to be a person's birthday so: \[ \mathbb{P}(\textnormal{person $i$ is born on day $j$}) = 1/365\] Trying to calculate the probability that in a set of $n$ birthdays no two are equal gives (for $n=3$): \[ \mathbb{P}(\textnormal{no two equal birthday}) = \frac{365}{365} \cdot \frac{364}{365} \cdot \frac{363}{365}\] This is just the chain rule in probability with independent events. In general for $k=365$ and $n=N$ we have:
\[ \mathbb{P}(\textnormal{no two equal birthday}) = \frac{k}{k} \cdot \frac{k-1}{k} \ldots \frac{k-N+1}{k} \]
\[ \mathbb{P}(\textnormal{no two equal birthday}) = \frac{k!}{k^N \cdot (k-N)!} \] Now our probability that two have the same birthday for $N$ people and $k$ days is:
\[ \mathbb{P}(\textnormal{two have equal birthday}) = 1 - \mathbb{P}(\textnormal{no two equal birthday}) \]
\[ \mathbb{P}(\textnormal{two have equal birthday}) = 1 - \frac{k!}{k^N \cdot (k-N)!}\]
Even though the probability of two people having the same birthday in a group of $100$ people is $99.99997$%, it's obviously still possible for nobody to share a birthday. Even at $364$ people the probability is $99.99 \ldots 9$% with $>155$ nines and it's still possible for nobody to have a shared birthday.

Back to hashing

The basic mechanism in hashing is the same as in the assignment of birthdays. We have $n$ items and map each to one of $k$ slots. We assume the $n$ choices of slots are independent. A collision is the event that an item is mapped to a slot that already stores an item. A possible resolution of a collision adds the item at the end of a linked list that belongs to the slot. Now I'll discuss some expected values of different situations in hashes.

Here's a very cool paper showcasing the methods used in digital forensics for creating attacks in digital evidence: A review of collisions in cryptographic hash function used in digital forensic tools

E.V. of number of items in same slot

In a set of $n$ people and $k$ slots available, the expected number of items per slot is $n/k$.

This is pretty intuitive since the Pigeonhole principle gives us that there must exist a slot with at least $\lceil \frac{n}{k} \rceil$. This could in fact serve as a proof of this fact by the Probabilistic method. Since the expected value is $c$ there must exist an instance where the event has value greater than $c$.


Solution

We can assume without loss of generality that we are focusing on the first slot. Let $T_i$ be $1$ if the $i$-th item is mapped to the first slot, and $0$ otherwise. The expected value of the $\textnormal{r.v.}$ $T_i$ is $1/k$ obviously, therefore the expected value of the number of items mapped to the first slot is: \[ \mathbb{E}[T] = \sum \limits_{i=1}^{n} \mathbb{E}[T_i] = n/k \]

E.V. of number of empty slots

In a set of $n$ people and $k$ slots available, the expected number of empty slots is $k (1- \frac 1k)^n$.

Solution

Let $T_i$ be the indicator variable that is $1$ if slot $i$ is empty and $0$ otherwise. It is clear that $\mathbb{E}[T_i] = (1-\frac 1k)^n$ hence the expected value of the number of empty slots in total is by summing over $k$ slots: \[ \mathbb{E}[\textnormal{# of empty slots}] = k (1- \frac 1k)^n \] Especially interesting is that if $k=n$ then by randomly assigning items to slots the expected number of empty slots is about $36$% since $(1-\frac{1}{n})^n = 1/e$ in the limit.

E.V. of collisions

In a set of $n$ people and $k$ slots available, the expected number of collisions is $n-k+k (1- \frac 1k)^n$

Solution

This can be inferred from the number of empty slots. If $A$ is the number of empty slots, we have $k-A$ items hashed without collision and hence $n-k+A$ collisions. You can think of the first time that an item lands on a slot that doesn't end up being free. It contributes $1$, $k-A$ times. Taking the expected value of this we get: \[ \mathbb{E}[\textnormal{# of collisions}] = n-k+k (1- \frac 1k)^n \]

E.V. of the number of needed items to fill all slots

Given $k$ slots, the expected number of items needed to map to these slots is $k H_k$ where $H_k$ is the $k$-th harmonic number

Solution

The $k$-th harmonic number is just: $H_k = 1+\frac 12 + \frac 13 + \ldots + \frac 1k$.
This is trickier than the last examples because it involves a different kind of indicator variable.
Let $T$ be the number of draws needed in average to fill the slots. Let $T_i$ be the expected number of items to go from having $i$ slots filled to $i+1$ slots filled. Notice that this is not the same for different $i$. It is much easier to get from $1$ item filled to $2$ than from $k-1$ to $k$. This is exactly the intuition behind the solution. The probability of filling a new slot is: \[ \mathbb{P} = \frac{k-i+1}{k}\] hence our random variable has expected value: $\frac{k}{k-i+1}$ since it has geometric distribution. This is also the same logic behind the Mean Time to Failure from the post Unexpected expected values. You have an event with probability $p = \frac{k-i+1}{k}$ of occuring each time you "try". The expected value of the number of tries is $1/p = \frac{k}{k-i+1}$. With a computer crashing with probability $p$ each time you turn it on, you would expect to turn it on $1/p$ times and have it crash once. Another example is calculating the expected number of die rolls before landing a $6$.
This is called a random variable with a Geometric distribution.
Now by the Linearity of Expectations we conclude:
\[ \mathbb{E}[T] = \mathbb{E}[T_1+T_2+ \ldots + T_k] = k \left ( \frac 11 + \frac 12 + \frac 13 + \ldots \frac 1k \right ) \]
These examples show the mathematics behind hashing and why we should care about collisions.

Interesting Hash Table Problems

Problem 1 (Oxford Problem Sheet Question)
Each bucket in a hash table represents a mapping from certain keys (all those that have a certain hash code) to values. In our implementation of hash tables, each bucket is represented as an unordered linked list, with a time for lookup of $O(N)$, where $N$ is the length of the list. Wouldn’t it be a good idea to replace these linked lists with some other data structure that represents a mapping with better asymptotic complexity?
Problem 2 (Oxford Problem Sheet Question)
Suppose a hash table with $N$ buckets is filled with $n$ items, and assume that the values of the hash function for the $n$ items are independent random variables uniformly distributed in $\text{[0..N)}$. What is the probability distribution for the number of items that fall in a particular bucket? What is the expected number of comparisons of keys that are performed when get is called and the search is (a) successful and (b) unsuccessful?
Problem 3 (Oxford Problem Sheet Question)
The hash tables we have been studying use chains of dynamically allocated records to deal with the collisions that occur when two keys have the same hash code. This scheme has the advantages that it is easy to implement, and has a performance that degrades only gradually when the table contains many records. An alternative that avoids dynamic allocation is to use a fixed array of records $\text{table[0..MAX)}$. If a key $k$ hashes to $h$, and slot $\text{table[h]}$ in the array is already occupied by another key, then slots $\text{table[h+1], table[h+2],...}$ are tried until we find either the key $k$, or an empty slot (meaning that $k$ is not present); the search wraps round at the end of the array. This idea is called “closed hashing” (or “probing” in the lectures). Write an implementation of the ‘bag-of-words’ module based on this idea, with operations to add a word, find the count of a word, and to delete a word. Think carefully about how to implement deletion. Note that the simplest way of dealing with a completely deleted word is to set its count to 0. This is an acceptable solution but, since it has the potential to fill the table with “empty” entries, you may like to try a more sophisticated approach such as closing up gaps. Give a suitable datatype invariant. Your implementation does not need to be able to store more than $\text{MAX}$ entries (so you can ignore the issue of re-sizing).

References and links for further reading