Hashing

TagsCS161Datastructures

Hash tables: the big idea

We want to have a datastructure that can take in incomparable objects and store them such that the average insert, delete, and search operations are O(1)O(1). Of course, there is no free lunch; the worst case is O(n)O(n)

To be more precise, we want to avoid collisions. The more you collide, the slower the process becomes.

To do this, we have hash families. A hash function is a deterministic function, but we can sample from a hash family at random. We will come back to this later. We want to sample from a universal hash function

Details on implementation

A hash table is just a map whose key is the value placed through some deterministic hashing function. If there are intersections, there are two options

  1. Use a linked list and add all intersections to a linked list
  1. Use an open addressing plan and bump the intersection to the next open array index

Details on complexity

When we say that the hash function has O(1)O(1) complexity, we mean that over all the hash functions that we can sample, the average number of elements per bin is a constant number.

Working towards hash tables

  1. We could just try using the element values themselves as the address. This would definitely yield O(1)O(1) search, insertion, and deletion but it takes a ridiculous amount of space
  1. We could try something easy, like constantly using a digit of the number as the address. The problem here is an adversarial agent. By knowing the hash function, it is possible to game the system and yield the worst-case scenario
    1. ⇒ The key upshot is that no deterministic hash function can beat the adversary. When we say “deterministic,” we mean that there is only one hash function we choose. All hash functions are deterministic but we can have the options of choosing a random hash function from a pool of functions

Formalism

Let UU be a universe of size MM. This MM is absolutely huge. At most nn keys will show up, and we would like to map each nn to a unique bin as it shows up.

The Hash Function: an intro

The hash function can be viewed as a sort of mapping function that turns a distribution into a close of a uniform distribution as possible. So if we knew that the keys are random, then we can just compress the keys down into the appropiate bins, no questions asked. The reason why we can’t do that is because often, our keys are not uniformly distributed, and what’s more, an adversary can intentionally change the distribution.

Analyzing hash functions

Define the indicator random variable Xil=1{h(k)=h(l)}X_{il} = 1\{h(k) = h(l)\}. This implies that the hh is random, because the inputs are not! For universality, we want to show that E[Xkl]1/mE[X_{kl}] \leq 1/m, where mm is the number of buckets (see more details in the later section) for all k,lk, l

The number of keys that end up in the same bucket is just

So a good hash function will satisfy this constraint on the number of keys per bucket

Zooming out again

So we are taking the expectation across all possible hash functions. Again, hash functions are deterministic so we are not taking it across a specific hash function. If they are not random, then an adversary could find out which items intersect, and then continuously feed those elements.

Random hash functions

Hashing can be seen as a sort of game in which the adversary chooses any nn items and then you choose the random hash function. The adversary knows all about the process but has no control over the random selection. This is a little different than our saga on randomized algorithms. We were allowed to control the randomness then, because we wanted to look at what’s the worst case scenario. In this case, we care more about how the elements are added, not about chance itself.

Uniformly random hash function

Again, hash functions are deterministic. However, if we see the function f:U{1,...,n}f: U →\{1, ..., n\} where UU is the universe and kk are tha hash buckets, we can randomly assign elements in UU to kk to generate a hash function.

In fact, this is our first hash function. For every possible input in UU, just assign to {1,...,n}\{1, ..., n\} uniformly at random.

We can analyze this using the framework above. So let’s look at an arbitrary input uiu_i. In h(ui)h(u_i), the input uiu_i contributes one element (duh). When we hash all other values, each uju_j has a 1/n1/n chance of hashing into the same bin. So we get

E[M(h(ui))]=jnP(h(uj)=h(ui))=1+jiP(h(uj)=h(ui))=1+ji1n=1+n1n2\begin{align}E[M(h(u_i))] &= \sum_j^nP(h(u_j) =h(u_i))\\ &= 1 + \sum_{j\neq i}P(h(u_j) = h(u_i))\\&= 1 + \sum_{j\neq i}\frac{1}{n}\\&=1 + \frac{n-1}{n} \leq 2\end{align}

Because this is a constant number that doesn’t depend on nn, it is a universal hash function!

The problem

So the uniformly random hash function is nice but it has one huge problem. To sample a hash function from it, we need to define the mappings for all of UU, which is incalculably hard.

Hash families

There are a huge number of hash functions. In fact, there are nMn^M different hash functions. But can we pick a smaller family of hash functions such that the expectation still holds. Families that have this property is a universal hash family

Again, we want to be careful here. Underpinning everything is the idea that the adversary picks the numbers before you decide the hash function randomly. So the expectation is under adversarial choice. As such, you can’t just define a subset of hash functions randomly and hope it works.

Universal hash family: proving

We want to show that P(h(i)=h(j))1nP(h(i) = h(j)) \leq \frac{1}{n}, where nn is the number of hash buckets. This is derived from the indicator variable XijX_{ij}. In essence, we want to show that, regardless of how the adversary picks the numbers (hence the “any i,ji, j”), the chances of collision stay the same upper bounded by 1/n1/n.

Here’s the key steps

  1. Adversary knows everything
  1. Adversary picks the numbers
  1. We pick the random hash function and compute the chance of collision

In general, if there is a flaw in the hash function, you might be able to product a counterexample A,BA, B that hash to the same bins no matter what funtion parameters you end up picking. This {A,B}\{A, B\} can become your new set, which means that P(h(A)==h(B))=1P(h(A) == h(B)) = 1, which is much higher than 1/n1/n

Note: we have a strict 1n\frac{1}{n} bound not because it’s necessary; in fact, any k/nk/n yields O(1)O(1) lookup time. It’s just because a random hash function yields 1/n1/n, and for any other hash family, it would be nice to show that it is the same thing

Tractable universal hash families

The key problem is as follows: how can I make a hash function in which we scramble everything up so uniformly such that the intersection probabilities are the same as a uniformly random function, without the need to manually assign each mapping?

Prime hashing

Let pp be a prime number. Let

where a,b{0,...,p1}a, b \in \{0, ..., p - 1\} and pp is larger than the size of the universe where k,lk, l is chosen from. Now, given some a,ba, b, we know that the k,lk, l map to distinct r,sr, s. This is because

And none of the right hand side is 0 in mod p space, so the product must also not be zero. This is why we say that rsr\neq s. The key upshot is that we have created a bijective function in mod space.

However, this means that we are able to solve for a,ba, b as the following

This means that given some k,lk, l, each (a,b)(a, b) has a one-to-one correspondence with r,sr, s. As such, the probability of intersection is just

Now, we take another modn\mod n where nn is the size of the hash table. Intuitively, it’s like ‘coiling’ up this new number mapping into a circle. You want to find the probability that rsmodnr \equiv s \mod n. Now, given any choice of rr, there are (p1)/n(p-1)/n choices of ss that are equivalent modn\mod n. Just think about it for a second. And because we’ve established that r,sr, s are determined bijectively by a,ba, b, we can calculate the chance of intersection in terms of rr.

Because the probability is at most 1/n1/n of intersection, we conclude that this is a universal hash function, as desired.

Challenge questions

The answer is 1/H1/|H| because there is at least one hash function that does this, due to the pigeonhole principle.