Shi's blog: How Does FLAME Work

I recently had an engaging experience with a real-time AI talking head demo and became curious about developing something similar myself. Many relevant papers mention using the FLAME model as a proxy model to seed Gaussian splats. Although I've heard of FLAME for years, I hadn't explored it in depth until now.

FLAME is a flexible parametric head model capable of capturing variant poses, races, and expressions. With approximately 400 parameters, it can generate a comprehensive head model. Most of these parameters are learned from 3D scanned data using the machine learning technique Principal Component Analysis (PCA), which makes them challenging to interpret directly. A few additional parameters are related to pose variations.

The process of generating a 3D model in FLAME involves a sophisticated blend of shape blending and skeleton-based morphing. While the FLAME paper lacks comprehensive details, its predecessor, the SMPL paper (which focuses on a full-body parametric model), provides valuable insights into the underlying principles.

Unfortunately, FLAME's codebase is not well-maintained. The existing implementations often don't work out of the box due to deprecated Python versions or outdated dependencies. Although some community members have attempted to patch the code, crucial fixes remain unmerged due to the project's inactivity. Researchers interested in using FLAME will likely need to investigate pending pull requests to resolve code-related issues.

FLAME Overview[SOURCE]

The big picture of how FLAME works is summarized well by the above picture. There are four steps overall. First, we create a shape in the neutral pose reflecting the identity and expression. Second, we need to adjust the joint positions according to the identity, because different people have different joint positions. Next, we will further adjust the base shape by applying an additional layer of displacement. This displacement accounts for shape changes caused by posing, for example the creases and wrinkles related to bending a joint. These features can't be simply modeled by skeleton-based local rigid transformations. Finally, we do the skeleton-based morphing to create large movements, like jaw opening or neck turning. There are four joints in total plus a global transformation of the root joint. The four joints are jaw, neck, right and left eyes. Since the FLAME model is untextured, the effects of rotating the right and left eyes are not very visible, hence what really matter are the root, jaw, and neck joints.

The first step is obtaining a base shape. This base shape should reflect the head's identity, such as age, race, and gender, as well as its expression. This step is purely based on blending shapes. A blending shape can be viewed as a weighted average of a set of shapes sharing the same topology, meaning only the vertex positions are different among these shapes, while the number of triangles and their connectivity remain the same.

In its implementation, there is a template model, called vTemplate, which can be viewed as the mean of all faces in a neutral pose and expression. This template model has approximately 5,000 vertices and 9,000 triangles.

There are 400 parameters to alter the model's identity and expression. The first 300 control the identity, and the last 100 change the expression. Since they are learned using Principal Component Analysis (PCA), it can be observed that parameters at the beginning always have a larger effect on the model than those at the end. Consequently, some FLAME programs only allow adjusting the first several parameters for simplicity.

With a set of 400 parameters, the goal is to calculate vertex displacement or offsets. This is obtained through matrix multiplication. The matrix, called the shape displacement, is $V \times 3 \times 400$ in shape (V is the number of vertices). Multiplying it with the 400 parameters as a vector yields vertex offsets of the shape $V \times 3$ . These offsets are then added to the template model.

The resulting model has expression, but its pose remains neutral—its neck doesn't turn, and its jaw doesn't open. The subsequent steps will handle the pose component.

As the second step, we need to calculate the position of the joints. There are four joints: the neck, the jaw, the right and left eyes. In addition, we also define the global transformation in the root joint. The position of the joints is correlated to the shape of the face, hence we perform this step after obtaining the blended shape. The joint positions are also calculated using matrix multiplication $J_regressor \times S$ , where J_regressor is a $J \times V$ matrix, J is the number of joints (5), and the shape $S$ is a $V \times 3$ matrix.

After determining the joint positions, the next step is to generate the pose feature. Joint poses are provided as a set of vectors, where the direction of a vector specifies the rotation axis and its norm indicates the rotation angle. The pose feature is constructed by subtracting the identity matrix from each of the rotation matrices corresponding to the joints, excluding the root pose. Excluding the root pose makes sense because the pose feature is intended to capture vertex displacements caused by local pose changes. The root pose represents the global transformation of the head and should not influence vertex-level displacements. However, the reasoning behind subtracting the identity matrix from the rotation matrices is less intuitive.

In the Paper, They Remove the Zero Pose From the Feature Vector[SOURCE]

In the SMPL paper, the pose feature includes a $-\textit{R}(\theta^*)$ term, where $\theta^*$ represents the zero pose, i.e., no rotation, corresponding to the identity matrix. This suggests that the pose feature is designed to only related to deviations from the identity matrix, isolating the effect of relative rotations rather than absolute pose configurations?

The rotation matrices are computed using the Rodrigues formula:

The Rodrigues Formula[SOURCE]

In this formula, $K$ is a matrix derived from the rotation axis, and $\theta$ is the rotation angle, equivalent to the norm of the pose vector. The $K$ matrix is defined by the components of the rotation axis vector.

K Is Defined by the Rotation Axis[SOURCE]

After converting the four pose vectors (excluding the root pose) into their respective rotation matrices, the resulting matrices are flattened into a single vector containing $4 \times 9$ elements. This vector forms the pose feature.

The pose feature is then used to compute an additional vertex displacement, referred to as the pose displacement. Human heads are covered in soft tissues that deform under various poses, such as wrinkles forming around the mouth or eyes. These deformations cannot be effectively modeled using skeleton-based transformations alone. The pose displacement accounts for these subtler, pose-dependent changes.

The pose displacement is calculated by multiplying the pose feature, a $P \times 1$ vector, with the pose displacement matrix, which has dimensions $P \times V \times 3$ . This operation produces vertex-level displacements that are added to the shape to model the effects of pose-specific deformations.

The final step involves skeleton morphing, which is more challenging to explain conceptually. While the code for this process is straightforward, I have not spent much time reasoning through the details. As such, I will focus on describing the procedure rather than its underlying rationale.

The first task in this step is to compute the relative position of each joint with respect to its parent joint. The joint positions we obtained earlier are expressed in absolute coordinates. To convert these to relative coordinates, the position of the parent joint is subtracted from the position of the current joint for all joints except the root.

Next, a transformation matrix is constructed for each joint, incorporating both its rotation and position. This transformation matrix is derived by concatenating the joint's rotation matrix with its relative position vector. This step corresponds to the transformation matrix equation described in the SMPL paper:

A Transformation Matrix Is Created by Concatenating the Rotation Matrix and the Relative Position Vector of a Joint[SOURCE]

To compute global transformations from these local transformations, the effects of each joint’s transformation matrix are accumulated through matrix multiplication with all its parent matrices. This accumulation ensures that the transformation at any joint reflects the combined influence of its local transformation and all preceding transformations in the hierarchy.

One aspect that I find unclear, and which the paper does not address adequately, is the question of the rotation origin. As we know, a pose transformation is defined by a rotation axis and an angle, but determining how to apply the rotation also requires specifying the rotation origin. The most intuitive choice for the origin would be the joint itself. Using this assumption, the transformation would involve first translating the joint to the origin, applying the rotation, and then translating it back to its original position. This could be represented mathematically as:

T \times R \times T^\prime

where $T^\prime$ is the translation matrix that moves the joint to the origin, and $T$ moves the joint back to its original position. However, I do not understand the reasoning behind the approach described in the paper, where a transformation matrix is simply constructed by concatenating a joint’s relative position and its rotation. This method does not seem to account for the rotation origin explicitly, which leaves its intuition unclear to me.

After this step, we obtain a set of global transformation matrices. The final column of each matrix provides the positions of the pose-adjusted joints. These adjusted joint positions, however, serve no functional purpose beyond visualization.

And finally, the process concludes with applying skeleton-based transformations. This step is the most difficult to understand. Using the global transformations computed earlier, the code subtracts them with $T \times J$ , where $T$ is a joint’s global transformation matrix and $J$ is the homogeneous vector of the joint's position. There are a few points to clarify. First, the joint positions used here are absolute, not relative. Second, while $J$ is described as the homogeneous coordinates of the joint position, its last element is actually zero instead of one. This step is implemented in the code as follows:

rel_transforms = transforms - F.pad(
        torch.matmul(transforms, joints_homogen), [3, 0, 0, 0, 0, 0, 0, 0])

This implementation corresponds to the following equation in the SMPL paper:

The Paper's Explanation of Removing the Transformation Due to Rest Pose[SOURCE]

The subtraction term is described as "removing the transformation due to the rest pose." However, I do not fully understand what this means. It might refer to a concept commonly known in computer animation, an area in which I lack sufficient background.

In the paper, this process of removing the transformation due to the rest pose is represented as applying an inverse matrix, $G^{-1}$ . However, in the code, it is implemented through the subtraction shown above.

Once the above transformations are obtained, using them to calculate the final shape is relatively straightforward.

The FLAME model is often applied to trace head position, shape, and expression in 2D images or videos. Given a 2D human head image, the goal is to derive parameters that match the FLAME model to the image. Direct comparison in pixel space can be too challenging, so the RingNet paper instead utilizes facial landmarks. To achieve this, we also need to extract facial landmarks from the FLAME model. Unfortunately, RingNet's implementation is part of another poorly maintained python codebase, which I eventually gave up trying to run.

In the RingNet paper, facial landmarks are categorized into static and dynamic landmarks, a distinction that was initially confusing. Static landmarks are straightforward; features like the nose and most other facial landmarks are static. The dynamic landmark, however, refers specifically to the cheek line. Its "dynamic" nature arises because it depends on the viewing angle.

As shown in the image below, when viewing a face from the side, only a segment of the cheek line corresponds to the actual contour of the face. The cheek line on the far side of the face becomes occluded, and the dynamic landmark adapts to represent the visible silhouette instead.

When One Side of the Cheek Line Is Occluded, the Dynamic Landmark Becomes the Silhouette[SOURCE]

In the code, both static and dynamic landmarks are provided as IDs of the triangles containing a landmark point, along with the barycentric coordinates of those points.

Each Landmark Contains an Array of Positions[SOURCE]

For dynamic landmarks, each landmark point consists of an array of potential positions. To visualize these landmarks, one position must be selected from the array for each point. The selection process depends on the current viewing angle.

Mask Areas of the FLAME Model[SOURCE]

Finally, FLAME includes a mask file that specifies various facial regions, represented as sets of face IDs corresponding to those regions.