Select Kth lowest

TagsAlgorithmsCS161

These sections allow us to look at algorithms and exercise our analytic powers in the process

The problem

Problem: find the kth smallest number in an array AA. This is a really neat problem but it is also a good exercise in recurrence relationships

Naive solutions

One idea

  1. find the smallest
  1. remove the smallest
  1. repeat k times.

This has complexity O(kn)O(kn).

Second Naive solution

As another naive solution, we can sort the array and take the kth last number, which has O(nlogn)O(n \log n)complexity. Can we do better?

The lower bound

We know that it is at least O(n)O(n), because we need to consider all the elements in the array at least once.

Moving towards the solution

What if we could discard half the array at a time? We can actually do this if we choose a good pivot and then split the array into elements below or equal to the pivot, and elements above the pivot. This process has O(n)O(n) complexity, so our recurrence relationship is

T(n)=T(n/2)+O(n)T(n) = T(n/2) + O(n)

Which means that we can lower bound it to

T(n)T(n/2)+cnT(n) \leq T(n/2) + c'n

and by the master theorem, we get that T(n)=O(n)T(n) = O(n). Nice!

The algorithm

These are two equivalent expressions of the algorithm. One is a bit more interpretable. We subtract 1 on the last case because we exclude the lower half and the pivot from our search

The pivot

The hardest part is actually finding a good pivot algorithm ChoosePivot().

The best case scenario

Claim: if the pivot is always chosen to be the center, then it runs in O(n)O(n) time

Proof: We remove half of the elements each time, so T(n)T(n/2)+cnT(n) \leq T(n/2) + cn, which from the master formula we know yields O(n)O(n)

This is what we alluded to when we first started the problem. However, to find the center, it’s equivalent to calling Select(A, n, n / 2), which causes infinite recursion. So as cool as this sounds, we can’t do this directly.

The worst case scenario

Claim: if the pivot is always the minimum or minimum element, then the function runs in Θ(n2)\Theta(n^2) time

Proof: We use a recurrence relationship. T(n)=T(n1)+Θ(n)T(n) = T(n-1) + \Theta(n) because the number of elements decreases by 1 to process. By rolling it out, we can say taht

T(n)c1n+c1(n1)+...c1=c1n(n+1)2T(n) \leq c_1 n + c_1(n-1) + ... c_1 = c_1 \frac{n(n+1)}{2}

With a different constant c2c_2, we show that the opposite inequality is true. So we conclude that T(n)=Θ(n2)T(n) = \Theta(n^2)

Random selection

Why don’t we choose randomly? It’s actually a good idea! The expected runtime is O(n)O(n) and in most pivot-based algorithms like QuickSort, we just end up doing this because it’s the easiest. But for intellectual stimulation, let’s look at something a bit more complicated.

Median of Medians

What about using something “close enough?” Let’s introduce the Median of Medians algorithm, which is shown below

The idea is simple

  1. make groups of 5
  1. sort these groups. Individually, the sorting takes O(1)O(1) because there will always be 5 elements. So the sorting process is O(n)O(n), because the groups depends on nn
  1. Make a set of all the medians of the groups
  1. Find the median of this group recursively.

Bounds of Median of Medians

We can actually prove this fact: A<7n/10+5|A_{<}| \leq 7n/10 + 5, and A>7n/10+5|A_{>}| \leq 7n/10 + 5. Here’s the proof sketch

  1. pp is our partition. It is greater g/21\lceil g/2 \rceil - 1 group medians. In each of the medians that pp beat (except for the remainder group), it is implicit (due to the first median) that pp beat out 33 of the numbers in each group. Therefore, pp is greater than at least 3(g/22)3(\lceil g / 2 \rceil - 2) elements
  1. As such, we the number of elements greater than pp is upper-bounded by (n1)3(g/22)(n - 1) - 3(\lceil g / 2\rceil - 2), which when simplified, becomes 7n/10+57n / 10 + 5. Neat!!

By symmetry, we can argue the other way around and show that the same property holds for the number of elements less than pp

Analyzing the complexity

So every step we take T(n/5)T(n/5)-ish steps to get the pivot, and then we are bounded by T(7n/10+5)T(7n/10 + 5) for the next partition. When n5n \leq 5, then we can say that this runs in constant time (i.e. we can add an additional base case to the above algorithm such that at low enough array sizes, we just do brute force sort and index to find the k-th smallest. This yields the following:

Inductive substitution method

We want to prove that it’s in linear time, so let’s postulate that T(k)dkT(k) \leq dk for some dd and k>k0k > k_0.

This becomes our inductive hypothesis, and we are ready to induct. Note how we use our strong inductive hypothesis to drive the computation forward

At this point, you show that this dd does exist, which is sufficient. Now, for the formal proof, just go back and let d=10cd = 10c and show why it works.