CUPLAS

Fast 3DGS Compression via the Parallel Linear Assignment Algorithm

CUPLAS Code PLAS Method

CUPLAS vs PLAS

CUPLAS

00.00s

PLAS

00.00s

Runtime Scaling Comparison

The PLAS Algorithm

The PLAS algorithm (short for Parallel Linear AsSignment algorithm) was developed by the Vision & Imaging Technologies department at the Fraunhofer Heinrich Hertz institute in Berlin as part of a method for compact 3D Gaussian Splatting (3DGS) scene representation, called Self Organizing Gaussian Grids.

The problem tackled by PLAS is to arrange a quadratic number of points $p_1, \dots, p_{H^2} \in \mathbb{R}^c$ with $c$ features each into a quadratic grid which is locally as smooth as possible. (Think of on the order of a few to a couple of dozens many features for the case of 3DGS scenes.) We can formalize this desideratum as looking for a bijective map $f:[H^2] \rightarrow [H]^2$ minimizing, for example, the average neighbor distance (AND)

\mathrm{AND}(f) := \frac{1}{|E|}\sum_{((i,j),(k,l))\in E} \|p_{f^{-1}(i,j)} - p_{f^{-1}(k,l)}\|_2

where $E$ denotes the set of neighbor relationships, for example the edges shared by pixels. Of course, it is rather nonsensical to look for an efficient and exact algorithm as (the corresponding decision problem of) this combinatorial optimization problem is NP-complete: Membership to NP is trivial as a solution can easily be encoded in a certificate of linear order in the input size, and it can be verified in linear time. NP-hardness can be seen by reducing from the Euclidean Hamiltonian Path problem. (See box below for a concrete construction.)

PLAS is a heuristic for this NP-complete problem designed to leverage parallelized computation in order to quickly obtain a best effort solution. The basic idea is start with a fully random grid assignment and then iteratively improve the solution by doing local swaps. The main loop works roughly like this:

Blur the current grid to obtain a target grid. Initially choose a large blur radius covering almost the whole grid, and then decrease it geometrically every main loop iteration until hitting a minimum blur radius upon which the main loop stops.
Partition the grid into square blocks where the width is the greatest even number not greater than twice the blur radius.
For each square block, randomly partition in parallel all pixels of the block into groups of 4.
For each of these groups, calculate in parallel for all 4!=24 permutations the sum of the (squared) pixel-wise distances to the target grid and then apply the permutation with the smallest such value. Repeat this step, each time with a different partition from step 3 for a fixed number of times or until the improvement in terms of the distance to the target grid is below a threshold (for a at least a fixed number of iterations in a row.)

Loading visualization…

One important detail on step 2 is that usually the grid cannot be tiled perfectly into blocks. Some parts will be left over and not touched in the subsequent swapping phase. To compensate for this steps 2-4 are wrapped in another loop which samples for each iteration a random xy-offset in $[0, \mathrm{B}-1]^2$ where $B$ is the current block size, and then applies it all blocks. A nice side effect is that this opens up more opportunities for the pixels to cross block boundaries, if beneficial.

From PLAS to CUPLAS

When profiling the PLAS algorithm with nsight several performance killers reveal themselves. In the following we look at the three main ones and develop CUPLAS' mitigation.

Parallelized Filters with Large Kernel Sizes

It may come at a surprise, but one of the main scalability bottlenecks of PLAS is the plain blurring. The kernel sizes of the blurring steps in earlier iterations are on the order of the grid side length. To the best of my knowledge, all major libraries providing gpu-accelerated Gaussian or box filters either don't support arbitrarily large kernel sizes, or scale terribly due to poor parallelization or even algorithmic complexity.

When thinking about performance for large inputs on a single accelerator, it is usually best, to first optimize for work complexity, and then in a second step think about parallelization and GPU-friendliness such as coalesced memory access and cache utilization for the number of processing units in a single accelerator is rather limited.

The naive way of applying a filter (a.k.a. a convolution) to a grid scales in terms of work to be done as $\Theta(N^2 \cdot k^2)$ where $N$ and $k$ are the side lengths of the grid and kernel, respectively. PLAS uses kornia.filters.gaussian_blur2d which realizes separable filters as two composed batched 1D convolutions applied to every row in the first pass, and every column in the second pass. This leverages that a 2D convolution with a separable kernel $K(x,y)=k_1(x)k_2(y)$ can be written as

\begin{align*}(K * H)(x,y) &= \int_{s, t} k_1(s)k_2(t)H(x-s, y-t)\,ds\,dt \\ &= \int_t k_2(t) \int_s k_1(s) H(x-s, y-t)\,ds\,dt \\ &= \int_t k_2(t) (k_1 * H(\cdot, y-t))(x)\,dt \\ &= (k_2 * (k_1 * H(\cdot_x, \cdot_y))(x))(y)\end{align*}.

This reduces the work complexity to at most $\mathcal{O}(N^2 \cdot k)$ . In kornia each pass is realized via torch.nn.functional.conv2d by passing a $1\times k$ and then a $k\times 1$ sized kernel. On a CUDA, torch then internally dispatches this ultimately to the cuDNN convolution API which calls different algorithms based on constants, dtype, tensor layout, batch size, channel count, padding, and even the GPU architecture. For $k \in \Omega(N)$ it might leverage that in frequency space convolution is nothing but pointwise multiplication: Transforming both the row and the kernel with the discrete Fourier transform (DFT), multiplying the results pointwise, and then applying the inverse DFT onto the result is equivalent to a convolution and scales as $\Theta(N\log N)$ . In total this then gives a work complexity of $\Theta(N^2\log N)$ . However, depending on the hardware, cuDNN might decide that the extra DFT constants are not worth it and perform instead express the convolution as an implicit GEMM, leaving us with a work complexity of $\Theta(N^3)$ .

For our use case we don't need general convolutions though. We only care about convolutions with Gaussian kernels. Now recall that the density $\rho$ of the sum $X = X_1 + X_2$ of two independent random variables with densities $\rho_1,\rho_2$ can be expressed as the convolution $\rho = \rho_1 * \rho_2$ . Further recall, that by the central limit theorem

\frac{1}{\sqrt{n}}\sum_{i=1}^n X_i \overset{d}{\longrightarrow}\mathcal{N}(0, \sigma^2)

when $X_1,X_2, \dots$ are i.i.d. with a vanishing expectation and a variance of $\sigma^2 < \infty$ . This suggests that for a series of n-fold convolutions we have convergence

\left( \sqrt{n}\rho_1(\sqrt{n}\,\cdot) \right)^{*n} \longrightarrow \mathcal{N}(\cdot\,;0, \sigma^2)

of some kind. Local limit theorems give conditions implying certain types of convergence. For example, boundedness of $\left( \sqrt{n}\rho_1(\sqrt{n}\,\cdot) \right)^{*n}$ for any single $n$ is already sufficient (and necessary) for uniform convergence [Petrov, 1975]. That means, we can approximate our Gaussian density by convolving a properly scaled uniform density a few times, say 3 times. The great advantage of uniform/box kernels is that applying them can be sped up by precomputing a prefix sum array: Each entry of the result is then merely a subtraction of two values of the prefix sum array which in turn can be computed in $\mathcal{O}(N^2).$ Overall, this approach yields a work complexity of $\Theta(N^2)$ for applying an approximated Gaussian filter to the grid.

From a high level, the implementation in CUDA looks like this. The interleaved input of shape $H \times W \times C$ is first reshaped into a channel-major layout $C \times H \times W$ by a small gather_channels_kernel. We then loop over channels and, per channel, run the row-wise iterative box filter, transpose the plane, run the row-wise iterative box filter again — at this point we are effectively filtering columns of the original grid — and transpose back. At the very end a scatter_channels_kernel reverses the initial reshape. Each row-wise box filter is itself a small loop of $n$ passes (3 in our case), where each pass first computes the inclusive prefix sum of the row, and then for each pixel forms a windowed difference of the prefix sum, with the borders handled by simple value replication. Doing the column pass via a transpose, instead of fusing it into the same CUDA kernel, sounds wasteful at first, but it is well worth it: it lets the prefix sum stride along the fast axis of memory and keeps both the global memory access and the shared memory layout coalesced.

The interesting part is the prefix sum. Scanning a row of width $W$ sequentially is $\Theta(W)$ in work but also $\Theta(W)$ in depth; we want to keep the work linear and at the same time bring the depth down to $\Theta(\log W)$ . The standard work-efficient solution is the Blelloch scan, which interprets the row as the leaves of a balanced binary tree and walks over it twice: an up-sweep that builds partial sums into the internal nodes, followed by a down-sweep that turns these partial sums into prefix sums.

Mapping this onto CUDA is fairly direct. We launch $H$ blocks, one per row, with $W/2$ threads each, where every thread takes care of two elements of its row, indexed ai = threadIdx.x and bi = threadIdx.x + W/2. The whole row lives in shared memory throughout the entire iterated box filter — actually two copies of it: the row_wise_iterative_box_filter_kernel keeps two row-sized buffers in shared memory and ping-pongs between them across iterations, so that the scan reads from temp[turn] and the box-filter pass writes the result into temp[1-turn]. After the last iteration, the active buffer is written back to global memory in coalesced fashion.

A subtlety I had to learn the hard way (and only became really aware of after looking at the kernel in nsight): bank conflicts in row_wise_inclusive_scan_in_place. Shared memory on NVIDIA GPUs is split into 32 banks of 4 bytes each, and two threads of the same warp accessing different addresses that map to the same bank get serialized. The Blelloch scan's access pattern is essentially its own worst enemy here: at level $d$ of the up-sweep, the active threads read pairs of cells with stride $2^d$ , so once $2^d \geq 32$ (i.e. from the fifth level on) all active threads of a warp land in the same bank, and we get a 32-way conflict on the critical path. The classic mitigation is to insert one padding word every 32 entries, so that the value at logical index $i$ ends up at physical offset $i + \lfloor i / 32 \rfloor$ . That is precisely what the macro CONFLICT_FREE_OFFSET(i) = i >> 5 does, and in practice it is enough to scatter the strided accesses across all 32 banks. (One can go further with a second-order term (i >> 10) to also clean up the small 2- and 4-way conflicts at the bottom of the tree, but on the kernels I profiled the gain wasn't worth the extra shared memory.) The same offset is threaded through row_wise_box_filter_replicate_border and the buffer indexing in the parent kernel, so every read and write of the row buffer goes through it.

Local Generation of Permutations and their Inverse

Another issue concerns the random partitioning of each block into groups of 4. This is done by enumerating all pixels in a block from 1 to $n$ , generating a random permutation, i.e., uniformly drawing a sample $\sigma$ of $S_n$ , and then putting the pixels with indices $\sigma(\{4g+1, \dots, 4g+4\})$ into group $g$ . Here $n$ is the number of pixels in a block. In the earlier iterations of the algorithm $n$ is therefore on the order of $N^2$ . So these permutations can become quite large. PLAS generates these permutations by calling torch.randperm which on CUDA materializes the array [0, ..., n-1], then assigns each index a random key, and finally sorts the array according to the associated keys with a GPU implementation of Radix sort, i.e., work complexity of $\mathcal{O}(n\log n)$ and a depth of $\mathcal{O}(\log n)$ . When we neglect key collisions, this yields a uniformly distributed permutation. However, there is still quite a lot of data movement. For large $n$ the array easily exceeds shared GPU memory. So there is significant non-local global memory.

We can avoid this by trading in some entropy for entirely local computation. The idea is that we only draw one random key $k$ which parameterizes a bijective function on $[n]$ . Then we launch $n$ threads and let thread $i$ compute $f_k(i)$ . The crux lies in designing an $f$ which for any $k$ induces a bijective $f_k$ which is both quick to compute and yet hard to distinguish from a uniform sample by statistical tests. Mitchel et al. succeeded in this quest by designing a special Philox cypher also supporting an odd number of bits. Permutations of size $n$ not equal to a power of two are obtained by generating a permutation with a size of the next greater power of two and then deleting all elements greater than $n$ (or greater equal in case of 0 indexing...). This compaction is realized by scanning the boolean mask of the elements we want to keep and using the result as the write indices in a gather operation. Although this is also non-local global memory access, it is still much less than in a GPU radix sort.

If we are willing to trade in more entropy (and accept statistically slightly worse sorting quality) for more performance, we can also use a much simpler (affinely) linear congruential generator (LCG). The idea is that for coprime $n, m$ the mapping $i \mapsto (m\cdot i)\,\mathrm{mod}\,n$ is a bijection on $\{0, \dots, n-1\}$ .

Adding a bias to the mapping $i \mapsto (m\cdot i + b)\,\mathrm{mod}\,n$ doesn't change that property. Note that it doesn't make sense to choose values for $m,b$ outside of $\{0,\dots n-1\}$ . And for $m$ there is the additional requirement that it must is coprime to $n$ . Precomputing for every single $n$ all the $n$ -specific coprimes numbers less than $n$ is expensive and requires a lot of storage. A simple alternative is to pick a random prime $p < n$ from a precomputed list of primes up to the greatest $n$ we expect during algorithm runs. As $n$ is always divisible by 4 it is certainly going to be coprime. The number of different permutations we can draw from scales according to the prime number theorem as $\Theta(\frac{n^2}{\log n})$ . So we really give up quite some entropy. On the flip side, no compaction is required. Memory access is entirely local. In fact, we don't have to do any memory access as the permutation values can be computed just in time by downstream kernels which need them.

For PLAS it is also desirable to quickly compute the accompanying inverse of any permutation we generate. Let me explain why. After step 2 the grid is in "blockified" form in memory in the sense that each block spans a contiguous global memory segment. Now there is a question whether the ordering induced by step 3 should be materializes in memory. That is, should we (a) apply the permutations on the blocks, then permute inside the groups with a highly local write-back, and then apply the inverse permutations on the blocks, or (b) just gather in step 4 into shared memory and then scatter back after the optimal 4-group-permutation is found. Option (a) is superior as it allows us to avoid any scatters and only use gathers which have a higher bandwidth. This is why generating the accompanying inverse permutation on the fly is desirable. We could of course also use the permutation to realize the inverse permutation. But that would be a scatter! If we have the inverse permutation on the other hand, we can realize it with a another gather.

For LCG computing the inverse permutation is rather straight forward. Basically, we just have to calculate the multiplicative inverse of the generator $m$ in with respect to the ring $\mathbb{Z}_n$ which can be done by applying the Euclidean algorithm for finding the greatest common divisor (1 in our case) extended by a backtracking phase in which we multiply the constants of each step to arrive at the inverse.

For the Philox permutations the reversal is actually not fully discussed by Mitchel et al. They only show how compute the uncompacted inverse which is a straightforward reversal of their Philox cypher. Computing the compacted inverse is somewhat more subtle. The uncompacted inverse kind of contains the compacted inverse in the first $n$ positions. But they contain "uncompacted indices" which are too high. The offset of the value $k$ from the value we actually want in that same array position is equal to the number of elements greater than or equal to $n$ occurring before index $k$ in the uncompacted permutation. We can compute these offsets efficiently by an exclusive scan on the $\geq n$ -binary mask and then look them up and subtract them from the uncompacted inverse permutation in a gather fashion. This gives us the compacted inverse in the first $n$ spots. An equivalent way is to fill the spots of the uncompacted permutation with values less than $n$ with the numbers $0,\dots,n-1$ from left to right. We can do this efficiently by running an exclusive scan on the $<n$ -binary mask. Then we apply (in a gather interpretation) the uncompacted inverse to the scan result. The first $n$ entries of the result are then the compacted inverse permutation. The figure below works through both procedures step by step on the toy example $N=6$ , $n=4$ , $P=[2,5,1,3,0,4]$ (six is not a power of two, but for the compaction step that does not matter); the matching pseudo-code follows right after.

In symbols, with the uncompacted permutation $P$ of length $N$ , its inverse $P^{-1}$ , and the target compacted size $n$ , the two methods read:

Method (a): subtract offsets

\begin{aligned} m_i \;&\gets\; [\,P_i \geq n\,] && \text{for } i = 0,\dots,N{-}1 \\ S \;&\gets\; \mathrm{exclScan}(m) \\ Q^{-1}_j \;&\gets\; P^{-1}_j - S\bigl(P^{-1}_j\bigr) && \text{for } j = 0,\dots,n{-}1 \end{aligned}

Method (b): gather scanned positions

\begin{aligned} m_i \;&\gets\; [\,P_i < n\,] && \text{for } i = 0,\dots,N{-}1 \\ S \;&\gets\; \mathrm{exclScan}(m) \\ Q^{-1}_j \;&\gets\; S\bigl(P^{-1}_j\bigr) && \text{for } j = 0,\dots,n{-}1 \end{aligned}

Memory Access Patterns and Volume

PLAS expresses every step of the main loop as a chain of PyTorch ops, while CUPLAS replaces each step with a single fused kernel that pushes intermediate state into shared memory and registers. The layout transforms are the cleanest example. PLAS's params_to_blocky chains together torch.roll, unsqueeze, PixelUnshuffle, two permutes, two reshapes, and a flatten, with several of these links materializing a fresh tensor of size $\Theta(C\cdot H\cdot W)$ in global memory just to obtain the block-major layout — and the symmetric chain runs again at the end of every iteration to undo it. CUPLAS's blockify_kernel does the same thing in a single coalesced pass: each output element resolves its source location with a few index-arithmetic ops and writes once. The same pattern recurs at the channel boundary, where gather_channels_kernel and scatter_channels_kernel reinterleave the C×H×W layout exactly once per blur — not once per kornia call — and inside the row-wise box filter, where the row stays in shared memory through all $n$ iterations, leaving the global-memory volume of an entire blur at one read and one write per pixel per channel, regardless of kernel size.

The tightest fusion is in the per-block assignment kernel swap_within_group_kernel. Each CUDA thread owns one 4-pixel group; it loads its $4C$ grid values and $4C$ target values into shared memory once, packs the four 8-bit indices of a permutation into a single int, and enumerates Heap's algorithm over all 24 permutations entirely in registers — computing the 24 candidate distances and the argmin without touching global memory in between. Only the winning permutation is written back. PLAS performs the same logical step as a torch.gather of the candidate tuples, an einsum against the 24×4×4 one-hot to produce a per-group distance tensor, and a separate argmin kernel: three full-batch global-memory passes where CUPLAS has none.

CUPLAS does pay one price for these fused kernels. Permutations across CUDA blocks rule out true in-place updates, so the inner loop maintains two ping-pong copies of every blockified buffer (blockified_grid[2], blockified_target_grid[2], blockified_index[2]), roughly doubling the persistent global-memory footprint of the per-iteration working set. In return — together with the bank-conflict-aware shared-memory layouts of the scan and the box filter, and the just-in-time LCG permuter that never materializes its permutation array at all — almost every byte CUPLAS moves through global memory is pixel data the algorithm actually needs to look at.