Paper Reading Notes (1): 3D Gaussian Splatting for Real-Time Radiance Field Rendering

Main Achievements

30fps novel view synthesis at 1080p

Key Elements

3D Gaussians for scene representation
Anisotropic covariance optimization for accurate scene representation
Fast, visibility-aware splatting algorithm, using a tile-based and sorting renderer

Core Algorithm

Differentiable 3D Gaussian Splatting

3D Guassians are defined as ellipsoids at $ \mathbf x $ with full covariance matrix $ \Sigma $:

$$ G(\mathbf x) = e^{-\frac{1}{2}\mathbf x^T \Sigma^-1 \mathbf x} $$ and this can be alternatively represented with an epplisoid with rotation: $$ \Sigma = \mathbf R \mathbf S \mathbf S^T \mathbf R^T $$ so it can be efficiently represented with a 7-element vector (3 for scaling and 4 for quaternion). Each 3D Gaussian also uess sphereical harmonics to represent its color component, which can be written as the SH coefficient vector $ \mathbf c $. To summerize, each 3D Gaussian can be represented with the tuple $ (\mathbf x, \mathbf r, \mathbf q, \mathbf c, \alpha) $ where $ \alpha $ is the opacity and $ \mathbf r $ is the rotation vector.

Adaptive Density Control

Initial 3D Gaussians are formed with SfM points;
Adaptive control focuses on "under reconstruction" (not enough 3D Gaussians for the detailed geometric features) and "over reconstructions" (Gaussians cover too much area?) regions;
Densify Gaussians based on view-space gradients: clone in "under-reconstruction" regions and split in over reconstruction regions;

Fast Differentiable Rasterization

Tile-based rendering and pre-sorting all primitives for the entire image:

Split the screen into 16x16 tiles
Cull 3D Gaussians against view frustum and tiles
Instantiate Gaussians and assign them to tiles using depth and tile ID as key
Sort Gaussians using fast GPU Radix sort
For each tile, create a list of Gaussians and rasterize each tile independently
Warm up the computation at low resolution then upsample twice
Gradual optimization for SH bands: start from zero-th order and then add more bands

Unsorted Thoughts

Very efficient choice for scene representation is key to enabling real-time rendering at 1080p; at the same time the representation may not be the most suitable for digital humans.
How do we enable it for motion? One recent attempt is this work.
The Guassians placement/generation algorithm is still quite empirical. Could we derive that from image features automatically? Maybe even make it trainable?