Fast Graph Partitioning | Alan Heirich
A fast Fiedler vector method for spectral partitioning

Alan Heirich, May 7, 2021

In the Perception stage of the May autonomy pipeline we need to partition an affinity graph of LIDAR points. Points with high affinity should be grouped together into individual objects. Points with low affinity should be separated into different objects. We do this by recursively partitioning the affinity graph and computing a minimum cut at each step of the recursion. You can read about minimum graph cuts here:

Minimum cut

One method of computing minimum cuts is to compute the Fielder vector of the graph. You can read about Fiedler vectors here:
Algebraic connectivity

In brief, the Fielder vector of a graph G is a particular eigenvector of L(G) where L(G) is the Laplacian matrix of G. The Laplacian matrix is a simple transformation of the adjacency matrix. You can read about Laplacian matrices here:

Laplacian matrix

Specifically the Fiedler vector is the eigenvector that corresponds to the smallest (in magnitude) nonzero eigenvalue of L(G).
A standard method of computing eigenpairs is the Lanczos algorithm (pronounced Lants-sosh) which you can read about here:
Lanczos algorithm

We tried using the Lanczos algorithm implemented in the EVSL library ( On our matrices of size 5000x5000 partitioning one step using CUDA took many milliseconds. This is much too slow for real time applications. Profiling showed that most of the time was spent in one CUDA kernel running for 3 microseconds per instance. We concluded that the EVSL library is intended for much larger matrices where the execution time can effectively amortize the overheads of CUDA kernel invocation. We also believe that other implementations of the Lanczos algorithm would have similar performance issues so a different library would not be likely to give us a real-time method.

The solution we found was to implement a fast Fiedler vector method based mainly on Power Iteration. You can read about Power Iteration here:
Power iteration

Power Iteration is based on matrix-vector multiplication which executes very efficiently in CUDA. In addition to Power Iteration we need Shift-Invert and Matrix Deflation. You can read about Shift-Invert here: (
You can read about Matrix Deflation here: (

We start by constructing the Laplacian Matrix of the affinity graph L(G). L(G) is symmetric-positive-definite which means its eigenvalue spectrum ranges from 0 to the l2-norm of L(G). See ( Call this value L2.

We want to find the smallest (in magnitude) nonzero eigenpair of L(G). The Power Iteration finds the largest, not the smallest, eigenpairs. At first blush it seems that the Power Iteration cannot solve our problem. Our solution will be to construct a series of matrices L, L', and L'' which each has a different eigenvalue spectrum. We will apply the Power Iteration to L''(G) and correct the result to get the desired eigenpair of L(G).

To construct L'(G) we shift the spectrum of L(G) to the left by L2 so that the largest eigenvalue of L(G) becomes zero and the smallest (zero) eigenvalue of L(G) becomes -L2. This is accomplished by Shift-Invert, which amounts to subtracting L2 from the diagonal of L(G) to yield a new matrix L'(G). The eigenvalue spectrum of L'(G) is from -L2 to 0.

The Fiedler vector requires computing the smallest (in magnitude) nonzero eigenpair of L(G). The smallest eigenvalue of L(G) is zero whenever G is a connected graph. This has the corresponding eigenvector of 1^n. We need to remove the corresponding eigenpair from L'(G) in order to use a Power Iteration. In L'(G) this eigenpair has eigenvalue -L2. We use Matrix Deflation to eliminate this eigenpair from L'(G). Specifically we compute L''(G) = L'(G) - (lambda) qT q where lambda = -L2 and q is a vector of 1s. This yields a new matrix L''(G).

We can now use Power Iteration to converge to the extremal eigenpair of L''(G) which corresponds to the smallest nonzero eigenpair of L(G). Since eigenvalues of L''(G) are negative we have to perform two iterations to get a positive result. We have to add L2 to the resulting eigenvalue of L''(G) to get the corresponding eigenvalue of L(G).

We experimented with sparse matrix libraries for both the CPU (Eigen) and GPU (cuSparse). We found that matrix-vector multiplications with Eigen took 16 microseconds and with cuSparse took 3 microseconds. Since 3 microseconds is roughly the CUDA kernel invocation overhead we conclude that our problem is taking almost no compute time in cuSparse. The signs of the eigenvector components converged within two double-iterations. Since we only care about the signs of the eigenvector components for matrix partitioning, and not their magnitudes, we can stop the iteration at this point because the signs of the eigenvectors define a minimum graph cut. The algorithm for a single partition runs in under 20 microseconds in our tests.

With thanks to professor Alistair Spence, Math Department, University of Bath, England and Sajan Patel for helpful discussions.

Conclusion: we defined an efficient algorithm to compute a minimum cut of a connected graph based mainly on sparse matrix-vector multiplication. We found this algorithm ran orders of magnitude faster on a GPU than a comparable Lanczos iteration.