Gaussian Splatting has garnered significant attention since last year, particularly following this paper. This technique has piqued my interest due to its potential to become a standard approach for implementing 3D photography.
Traditional computer graphics predominantly represent 3D scenes using triangles, and GPU hardware is optimized for rendering large quantities of triangles efficiently. The quest for reconstructing a realistic 3D scene from a set of photos taken from different angles has been a vibrant research area. However, achieving photorealism remains a challenge. Triangles fall short when it comes to representing fuzzy, transparent, or non-distinctive objects like clouds or hair.
In recent years, there have been advancements in implicit 3D representations such as neural radiance fields. These methods can generate photorealistic results and handle non-distinctive objects well, albeit at the cost of intense computation and slow rendering speeds. Subsequently, researchers have been exploring new solutions that can bridge the gap between these approaches. Gaussian splatting emerges as a potential solution. However, this is a rapidly evolving field, with new methods coming out attempting to address some of the limitations of Gaussian splatting. Nevertheless, understanding Gaussian splatting deeply is worthwhile, as it may form the foundation for many future methods.
In this post, we'll explore how to render a single Gaussian splat and make sense of its mathematics. This post alone won't be sufficient to render an entire Gaussian splatting scene, but it will dive into the core aspects of the technique. Similar to my other posts, rather than focusing on mathematical rigor, I'd like to emphasize intuition, as I believe it is crucial for understanding complex concepts. For those interested in formal descriptions of the method, I'll recommend additional resources. I've also received invaluable assistance from the GaussianSplats3D reference implementation in Three.js. I highly recommend checking out this project.
I plan to provide a full implementation of Gaussian splatting as an example for the WebGPU book I'm currently working on.
What is Gaussian Splatting? As previously mentioned, while triangles are the most commonly used graphics element, there has been a pursuit of using other primitives instead of triangles as the basis for rendering. Points, for example, have been another common choice. Points don't require explicit connectivity, making them handy when representing a 3D scene where connectivity is difficult to sort out, such as those captured using a 3D scanner or LiDAR. However, points don't have size. When viewed closely, points might not be dense enough to cover a surface. One idea for alleviating this problem is to render small disks, called surface splats, instead of points. Because a disk can have a size, tweaking their size can help cover the holes. The disks usually don't have a solid color but have a fading color on the edge defined by a Gaussian. This is useful to blend nearby disks to create a smooth transition.
Another common choice is called volume rendering. A volume is a 3D grid of voxels, which can be viewed as 3D pixels that absorb light. We shoot out rays from the camera and test the intersection between the rays and the voxels, blending their color values along each ray. Voxel rendering excels in representing fuzzy surfaces, but the traditional way of storing a volume is very costly.
Gaussian splatting can be viewed as a hybrid of the two. Each rendering element is a 3D Gaussian function, or a blob. It has a position and a size, similar to a surface splat. But unlike a surface splat, a Gaussian splat also has a volume, not just in 2D. Moreover, a Gaussian splat has a shape. The basic shape is a sphere, but we can stretch and rotate it into an arbitrary ellipsoid. The color defined in a Gaussian splat fades too according to its defining Gaussian function, similar to a surface splat. Gaussian splatting combines the benefits of both aforementioned approaches. It doesn't require connectivity, it's sparse, and is capable of representing fuzzy surfaces.
Although Gaussian splatting gained popularity last year through this paper, the concept isn't new, as the rendering part was first mentioned in this paper. To summarize the process, given a scene with many Gaussian splats, we first perform a series of rotations and scalings on them, similar to what we do to vertices in the normal rendering process. Then, these splats are projected onto the screen as 2D Gaussians. Finally, they are sorted based on depth for applying front-to-back blending.
If you're finding it challenging to visualize this concept, I recommend watching the YouTube video linked above to gain a clearer understanding.
To understand how to render a Gaussian splat, we need to be familiar with the geometric interpretation of a 3D Gaussian function. For that, I recommend this document. I will skip the details and summarize what's useful for our implementation.
A 3D Gaussian is defined by two parameters: a centroid and a 3x3 covariance matrix. Intuitively, the centroid is the position of a Gaussian, and the covariance matrix defines the shape of the Gaussian (an ellipsoid).
\mathcal{N}(\mu, \sigma^2)
If we consider 1D Gaussians, they are defined by a mean \mu
and a variance \sigma^2
. In this context, the mean \mu
dictates the center of the Gaussian on the x-axis, while \sigma^2
determines its spread or shape.
The same principle extends to 2D and 3D Gaussians, where they are defined by a mean \mu
(or centroid) and a covariance matrix \Sigma
. In 2D, the covariance \Sigma
is a 2x2 matrix, and in 3D, it is a 3x3 matrix. The covariance matrix holds geometric significance, defining the ellipsoidal shape of the Gaussian.
\mathcal{N}(\mu, \Sigma)
Below is an illustration showing various 2D Gaussian shapes and their corresponding covariances.
We can shift the centroid to adjust a Gaussian's position, and we can also rotate and scale the covariance matrix to modify its shape. This process is akin to applying a model-view matrix to vertices during rendering, although the mathematical operations involved are not identical.
The fundamental insight from the geometric interpretation outlined in the document is that the covariance matrix \Sigma
specifies a rotation and scaling transformation that morphs a standard Gaussian with the identity matrix \mathbf I
as its covariance into an ellipsoidal shape. Let this transformation be represented by a 3x3 matrix \textit{T}
; it can be factored into a product of a 3x3 rotation matrix \textit{R}
and a scaling matrix \textit{S}
. Thus:
\begin{aligned}
\Sigma &= \textit{T} \textit{I} \textit{T}^\top \\
\Sigma &= \textit{R} \textit{S} (\textit{R} \textit{S})^\top \\
\Sigma &= \textit{R} \textit{S} \textit{S}^\top \textit{R}^\top \\
\end{aligned}
Additionally, if we wish to apply an additional transformation \textit{V}
to the covariance matrix, we can do so using the formula \Sigma^\prime = \textit{V} \Sigma \textit{V}^\top
. This is crucial for tasks such as implementing projection of the 3D covariance into 2D space.
Before examining into rendering Gaussian splats, it's crucial to establish a ground truth. Debugging graphics programs can be challenging, so visualizing the rendering result against this ground truth is essential to ensure the implementation's correctness. There are two primary methods to create the ground truth:
Generate a point cloud from a Gaussian distribution, where the Gaussian value serves as the sample density.
Deform a sphere into an ellipsoid based on the same transformation defined by the Gaussian's covariance matrix.
Once the Gaussian splat has been rendered, we anticipate seeing it overlap with the ground truth, providing a visual confirmation of the accuracy of the implementation.
In this post, I'll use the first method. Python offers excellent support for point sampling, so I'll begin by implementing a helper program in Python to sample points from a Gaussian distribution and save the result to a file. Then, I'll render these points using WebGPU.
import numpy as np
import codecs, json
# Define the parameters of the Gaussian function
mu = [10, 10, 10]
cov = [[0.8084480166435242, -0.376901775598526, -0.947654664516449]
, [-0.376901775598526, 2.381552219390869, 1.5799134969711304]
, [-0.947654664516449, 1.5799134969711304, 1.899906873703003]]
# Generate the data
X = np.random.multivariate_normal(mu, cov, 10000)
json.dump(X.tolist(), codecs.open('points.json', 'w', encoding='utf-8'),
separators=(',', ':'),
sort_keys=True,
indent=4)
The code is straightforward. The key line is np.random.multivariate_normal(mu, cov, 10000)
, which samples 10,000 points from a multivariate normal distribution (the Gaussian). Then, we save their positions in a JSON file.
The covariance matrix used in the program above may appear arbitrary. I will provide an explanation of how these values were determined later in this post.
Now, let's define the single Gaussian splat we intend to render. Since the specific Gaussian splat doesn't matter for our purposes, we will simply select one at random. In this example, I've chosen a rotation represented by a quaternion of (0.601, 0.576, 0.554, 0.01)
and a scaling vector of (2.0, 0.3, 0.5)
. Quaternions and scaling vectors are used because they are the information typically stored in a Gaussian splat file.
The first step involves deriving the rotation matrix and the scaling matrix to reconstruct the covariance matrix. For this task, we will utilize the glMatrix library.
const rot = glMatrix.quat.fromValues(0.601, 0.576, 0.554, 0.01);
const scale = glMatrix.mat3.fromScaling(glMatrix.mat3.create(),
glMatrix.vec3.fromValues(2.0, 0.3, 0.5));
const rotation = glMatrix.mat3.fromQuat(glMatrix.mat3.create(), rot);
Then, according to equation 3, the covariance matrix can be derived as follows:
const T = glMatrix.mat3.multiply(glMatrix.mat3.create(), rotation, scale);
const T_transpose = glMatrix.mat3.transpose(glMatrix.mat3.create(), T);
this.covariance = glMatrix.mat3.multiply(glMatrix.mat3.create(), T, T_transpose);
The covariance matrix hardcoded in the Python sampling function is identical to the one calculated here.
During rendering, we update the model-view matrix based on the camera. Using this model-view matrix and the projection matrix, our goal is to accurately project the covariance matrix onto the screen plane, converting the 3D Gaussian into a 2D Gaussian.
The first step is to obtain the model-view matrix from the camera and convert it into a 3x3 matrix (the upper-left 3x3 corner). In vertex transformation with homogeneous coordinates, a 4x4 matrix is constructed for translation, with the offset values assigned to the last column of the matrix. However, when dealing with the 3x3 covariance matrix, we cannot perform translation, only rotation and scaling. Therefore, to obtain a 3x3 model-view matrix, we simply use the upper-left 3x3 corner. This is in contrast to billboarding, where the 3x3 upper-left corner is assigned the 3x3 identity matrix to remove rotation and scaling.
const W = glMatrix.mat3.transpose(glMatrix.mat3.create(),
glMatrix.mat3.fromMat4(glMatrix.mat3.create(),
arcballModelViewMatrix));
The second step involves performing perspective projection, which is akin to applying the projection matrix. In computer graphics, applying perspective projection transforms coordinates from either camera space or eye space into clip space. Prior to this transformation, the visible space is enclosed within a view frustum, which is a 3D trapezoidal volume. After the transformation, the visible space is represented as a cubic volume.
However, it's important to note that perspective transformation is not affine, which means that the original ellipsoidal shape will be warped once projected.
Since non-affine transformations cannot be represented as a 3x3 matrix and applied directly to the covariance matrix, our objective is to find an approximation that can transform the Gaussian in a manner that simulates the projection.
To approximate this transformation, we first consider how a point is projected onto the screen.
Given a point (x_e, y_e, z_e)
in camera space to be projected onto the near plane at z=n
, the projection can be calculated as follows:
\begin{aligned}
x_p &= \frac{-n x_e}{z_e}\\
y_p &= \frac{-n y_e}{z_e}\\
z_p &= - || (x_e, y_e, z_e)^\top ||\\
\end{aligned}
This mapping from (x_e, y_e, z_e)
to (x_p, y_p, z_p)
is not affine and cannot be represented by a 3x3 matrix. According to the paper referenced here, we can approximate this transformation using the first two terms of the Taylor expansion of the mapping function: v_p + J(v-v_c)
, where v_p
is the projected centroid of the Gaussian, and J
is the Jacobian or the first-order derivative.
The concept of Taylor expansion, while its name might seem daunting, is actually quite straightforward. Let's illustrate it using a 1D unknown function y = f(x)
. Our aim is to estimate the function near a specific input point x_0
. When we focus on a very small window around x_0
, the function can be approximated by a straight line. This line can be obtained by calculating the derivative of the function f
. Therefore, the value of f(x)
where x
is close to x_0
can be approximated by \bar{y} = f(x_0) + f^\prime(x_0)(x - x_0)
.
In our projection scenario, f(x_0)
represents the projected position of the Gaussian's centroid, which we can easily calculate using the model-view matrix and the centroid position. f^\prime(x_0)
corresponds to the Jacobian matrix, which we will use to transform the covariance matrix.
The Jacobian matrix (transposed) can be calculated as follows:
\begin{pmatrix}
\frac{d x_p}{d x_e} & \frac{d y_p}{d x_e} & \frac{d z_p}{d x_e}\\
\frac{d x_p}{d y_e} & \frac{d y_p}{d y_e} & \frac{d z_p}{d y_e}\\
\frac{d x_p}{d z_e} & \frac{d y_p}{d z_e} & \frac{d z_p}{d z_e}\\
\end{pmatrix}
Based on equation 4, we can calculate the Jacobian matrix as follows:
\begin{pmatrix}
-\frac{n}{z_e} & 0 & 0\\
0 & -\frac{n}{z_e} & 0\\
\frac{x_e}{z_e^2} & \frac{y_e}{z_e^2} & 0\\
\end{pmatrix}
The last column should not be all zeros, but in our reference code, the last column is indeed set to zeros. I believe the reason for this is that for the projection of Gaussians, we can effectively remove the depth information by assuming z_p
is a constant. This is because we do not need depth testing for Gaussians; instead, we will sort them from front to back and render them in that order.
Here is how the Jacobian matrix (in column-major order) is calculated in the code, based on the perspective projection matrix:
const focal = {
x: projectionMatrix[0] * devicePixelRatio * renderDimension.x * 0.5,
y: projectionMatrix[5] * devicePixelRatio * renderDimension.y * 0.5
}
const s = 1.0 / (viewCenter[2] * viewCenter[2]);
const J = glMatrix.mat3.fromValues(
-focal.x / viewCenter[2], 0., (focal.x * viewCenter[0]) * s,
0., -focal.y / viewCenter[2], (focal.y * viewCenter[1]) * s,
0., 0., 0.
);
The focal vector can be calculated by extracting the first and sixth elements from the projection matrix. These elements represent the horizontal and vertical arctangents of the half field of view. By multiplying them with the half width and half height of the canvas, respectively, we obtain the distance from the camera to the near plane. In theory, both focal.x and focal.y should be equal.
With both the 3x3 model-view matrix W and the Jacobian matrix J, we can now update our covariance as follows:
const W = glMatrix.mat3.transpose(glMatrix.mat3.create(),
glMatrix.mat3.fromMat4(glMatrix.mat3.create(), arcballModelViewMatrix));
const T = glMatrix.mat3.multiply(glMatrix.mat3.create(), W, J);
let new_c = glMatrix.mat3.multiply(glMatrix.mat3.create(),
glMatrix.mat3.transpose(glMatrix.mat3.create(), T),
glMatrix.mat3.multiply(glMatrix.mat3.create(), this.covariance, T));
With the updated covariance, we extract its 2x2 upper left corner, which represents the covariance for the projected 2D Gaussian. Next, we need to recover two basis vectors from the 2D Gaussian, which correspond to the eigenvectors of the top two eigenvalues.
This is calculated using the following formula:
const cov2Dv = glMatrix.vec3.fromValues(new_c[0], new_c[1], new_c[4]);
let a = cov2Dv[0];
let b = cov2Dv[1];
let d = cov2Dv[2];
let D = a * d - b * b;
let trace = a + d;
let traceOver2 = 0.5 * trace;
let term2 = Math.sqrt(trace * trace / 4.0 - D);
let eigenValue1 = traceOver2 + term2;
let eigenValue2 = Math.max(traceOver2 - term2, 0.00); // prevent negative eigen value
const maxSplatSize = 1024.0;
let eigenVector1 = glMatrix.vec2.normalize(glMatrix.vec2.create(),
glMatrix.vec2.fromValues(b, eigenValue1 - a));
// since the eigen vectors are orthogonal, we derive the second one from the first
let eigenVector2 = glMatrix.vec2.fromValues(eigenVector1[1], -eigenVector1[0]);
let basisVector1 = glMatrix.vec2.scale(glMatrix.vec2.create(),
eigenVector1, Math.min(Math.sqrt(eigenValue1) * 4, maxSplatSize));
let basisVector2 = glMatrix.vec2.scale(glMatrix.vec2.create(),
eigenVector2, Math.min(Math.sqrt(eigenValue2) * 4, maxSplatSize));
In the vertex shader code, we utilize the basis vectors to construct a screen-aligned quad. The long edge of the quad should be parallel to the first basis vector, while the short edge should be parallel to the second eigenvector.
var clipCenter:vec4<f32> = projection * modelView * pos;
var ndcCenter:vec3<f32> = clipCenter.xyz / clipCenter.w;
var ndcOffset:vec2<f32> = vec2(inPos.x * basisVector1 + inPos.y * basisVector2)
* basisViewport;
out.clip_position = vec4<f32>(ndcCenter.xy + ndcOffset, ndcCenter.z, 1.0);
In the shader code, we manually calculate the position of the centroid in Normalized Device Coordinates (NDC). Since NDC has a range of [-1, 1] for both the x and y axes, we need to scale the ndcOffset. Because the Jacobian is calculated using pixels as the unit length, we need to scale it down by (\frac{w}{2},\frac{h}{2})
, where w
and h
are the width and height of the canvas, respectively.
The rendering result is shown above. As we can see, the Gaussian is correctly rendered, overlapping perfectly with the ground truth point cloud.