Strongly Connected Components

TagsCS161Graphs

Strongly connected components

When we are dealing with undirected graphs, anything we can reach with DFS or BFS forms a connected component. In other words, in this found tree, you can reach anything from anywhere.

However, in a directed graph, you form a connected component with DFS / BFS, but it is NOT the case that you can go from anything to anywhere. This is because there are one-directional roads!

A directed graph is strongly connected if each pair of vertices has a path between them. A directed graph is weakly connected if when you remove the directionality, the graph is connected.

Technically speaking, the connected components must be maximal. So in the picture below, the four vertices are strongly connected as one, not as two components.

Applications

SCC’s (strongly connected components) are important because that’s a lot like how our internet works. We have a lot of websites that point at each other. Can we find a set of websites such that we can always reach all the other websites by clicking through websites in that set?

Finding SCCs the naive ways

  1. we could just decompose the graph in all the different ways and check if they are strongly connected. This is kinda stupid.
  1. We could take a connected graph from uu and run DFS a bunch from each vv in that subgraph. This is better, but it’s still O(n2)O(n^2). Can we do better?

Kosaraju’s algorithm

  1. Do DFS to make a DFS forest
  1. Reverse all the edges in the graph
  1. Do DFS again to make another DFS forest.
    1. Start with the vertices with the largest finish time
    1. The SCCs are the different trees in the second DFS forest.

The intuition, part 1

When you flip the edges in a subgraph that is strongly connected, it is still strongly connected. This is because between v,wv, w there exists two paths, p1,p2p_1, p_2. When we invert all the edges, the two paths essentially invert. The forward path now becomes the backwards path. So, essentially, we can find the SCCs on the inverted graph. But how does that help us?

Ok, well that’s interesting. But how does that help us? Well, let’s make an SCC graph.

The intuition, part 2

This SCC graph has nodes that are strongly connected components (i.e. an aggregate of nodes) and edges between them. This graph must be a DAG, or else we would be able to collapse the nodes into each other (any cycle means a path!).

Before you inverted the edges, the node with the latest finish time is the node that goes first in the topological sort, meaning that it has no arrows going into it. When you invert the edges, now there is no arrows going from it. Effectively, it “traps” the SCC such that your DFS is limited only to this scope (because again, flipping edges doesn’t change SCC).

And this keeps on going. If you invert the edge, you are essentially inverting the topological sort so the next node you do DFS search in will also be “trapped”.

This is a really neat trick!

The formalism

When we prove this algorithm, we want to show that by starting from the highest finishing time, we will be within a strongly connected component and be trapped in this strongly connected component once the edges flip.

To do this, we just want to show that in the original graph, if we start traversing based on the largest finish time, we would always encounter nodes that have all arrows pointing away. To help us with the proof, we are actually going the other direction. We show that all ABA→B has AA finishing after BB. We are still using our idea of SCC graphs, in which each node is a strongly connected component. Our A,BA, B are both SCCs.

First, we will show that if ABA → B in the original graph, then the finishing time of AA is after the finishing time of BB. To show this ,we consider two cases

  1. We reached AA before BB in our DFS. Let’s say that yy was discovered last in BB, meaning that B.finish=y.finishB.finish = y.finish. Let’s say that zz was discovered first in AA, meaning that A.finishz.finishA.finish \geq z.finish. Then, we know that yy is a descendant of zz in the DFS forest. This mean that zz finishes after yy, which means that A.finish>B.finishA.finish > B.finish as desired.
  1. We reached BB before AA in our first DFS. There are no paths from BB to AA, so we must have explored AA after we restart the DFS.

So what’s the upshot? If ABA → B, then the finish time of the SCC node AA will always be after the SCC node BB. Again, note that we are not dealing with individual nodes because that we have already proven.

Therefore, in the reversed graph, we have the child finishing after the parent (in the original scheme). Therefore, the largest finish time would have all arrows leading in (if there were an arrow leading out, then that arrow would lead to the larger finishing time). And this “traps” it.

Induction

We will finally prove that this method works by induction. Let’s say that the first tt strongly connected components have been extracted already.

Now, suppose that you get the next tree with root xx, which lives in some SCC AA. Because xx has the largest finish time, we know that all the arrows point to the SCC that xx is from.

Therefore, there are no edges leaving AA to the remaining SCC’s, so a DFS started at xx recovers exactly AA, which completes the induction to make the t+1t + 1st tree.

Example uses

Largest “reach”

A “reach” is defined as the number of vertices reachable from one vertex. The naive solution is to run BFS on each vertex, but you can do better if you find all the SCCs (because all vertices are reachable within an SCC). Then, you order the components in topological order, and then you see, in reverse topological order, how many SCCs each component is connected to. Return any vertex in the component that has the most connections.