Bucket sort


Sorting paradigms

When we think of sorting, we must ask ourselves, “what is the input, output, and the operations allowed?” Depending on these answers, we might different limits on how fast we can go.

Bucket sorting

‘But it is true that we can get a better complexity if we change the assumptions up a bit. Now, what if we are dealing with “nice” integer numbers from 1,...k1, ... k?

Counting sort

  1. Make an array of size kk
  1. Go through the list and tally up the presence of each number in the appropriate array index
  1. At the end, read off the array of size kk and for each element k[i]k[i], output ii with a repetition of k[i]k[i].

This is a really micro brain sort but it actually works in O(n)O(n) time! However, there is a pretty big downside. If kk is very large, you will have problems. And it is typical that kk is larger than the list size nn, lest you repeat values.

A small catch, though. If the array has size bounded by [1,f(n)][1, f(n)], then we need to make f(n)f(n) groups, so our complexity is O(f(n))O(f(n)) if f(n)>nf(n) > n

Radix sort

This is a modification to counting sort that actually works decently.

  1. Run bucket sort using the least significant digit
  1. recombine into a new list
  1. repeat step 1 but you use the second least significant, third least significant, etc. When you get to the end, the recombined list is the sorted list
    1. if you encounter the end of some number (i.e. you are sorting 100, 2, and 45), just pad with zeros

Radix sort proof that it works

Here’s a proof by induction on a list with values x=[xi1,...x1]x = [x_{i-1}, ... x_1] where each xkx_k is a digit. Our inductive hypothesis is that at step k1k-1, the elements x1,...xk1x_1, ... x_{k-1} are in sorted order, hierarchically. This means that if xk1<yk1x_{k-1} < y_{k-1}, then it xx comes before yy. If these elements are equal, then we break the tie by looking at xk2x_{k-2}, xk3x_{k-3}, and so on. Intuitively, this ordering comes about from the stable concatenation operation and the reverse significance sorting order.

Radix sort complexity analysis

So let’s start simple. If you have dd digit numbers, the complexity is O(dn)O(dn), or O(n)O(n). Hey, this is cool. But wait...

If in the list of size nn you will have numbers from 1n1→ n, then dlognd \approx \log n. And suddenly our complexity becomes O(nlogn)O(n\log n). Ooops...

But wait! We can change the base! We calculate dd by doing d=floor(logrn)+1d = floor(\log_r n) + 1, where rr is the base that we bucket in. The larger the rr, the more buckets but the fewer digits to sort. For some intuition about the definition of dd, think of r=10r = 10 and n=9n = 9. The log is below 1, but we do have one digit. Therefore, we take the floor and add 1.

Let’s put this together. Let MM be the maximum number, and rr be the base. In each sort, you need to look across nn numbers and initialize rr bins. This yields

O(d(n+r))=O(floor(logr(M)+1)(n+r))O(d(n+r)) = O(floor(\log_r(M) + 1)(n+r))

How do we choose rr? Well, we can let n=rn = r, and now we get

O(n(floor(logn(M))+1))O(n(floor(\log_n(M)) + 1))

We pick n=rn = r because we want as large of an rr as possible, but if we make rr any larger than a scalar multiple of nn, then it will contribute a polynomial factor to the sort, which is not good.

If MncM \leq n^c for some finite constant cc, then we have O(n)O(n) complexity. However, if M=2nM = 2^n, then we have O(n2/logn)O(n^2 / \log n), which is pretty bad.

So at the end of the day, Radix sort can beat comparison sort, but in the worst case, it does get pretty bad

(as a sidenote, we don’t drop the +1+1 because it still is related to nn when you distribute, and rr in the original expression. It’s important to be careful of these things!)

Challenge questions involving lower bounds

If this algorithm existed, then you can do a divide and conquer sort such that

T(n)=nlognT(logn)+O(n)T(n) = \frac{n}{\log n}T(\log n) + O(n)

If we just roll this out, we can kinda see that we add O(n)O(n) for kk times, where kk is the number of times you need to apply a logarithm until you reach 11. This we know as logn\log^*n. However, it means that we can sort in O(nlogn)O(n\log^*n) time, which is impossible using comparison-based sorting.

Why is this impossible? Well, if we can do all three things, we can just add an unsorted list in O(n)O(n) time, and then pluck out the smallest objects in O(n)O(n) time, yielding an O(n)O(n) complexity sort that would violate our previous proof of the lower bound of O(nlogn)O(n\log n).