As an extreme visual learner, grasping concepts without a mental picture or a concrete example poses a significant challenge for me. Throughout my education, the majority of math learning materials overlooked the importance of providing visual interpretations, concrete examples, or relatable metaphors. This absence made understanding these concepts an uphill battle.
One common argument against emphasizing intuition in learning is the concern that it might compromise the rigor of the true mathematical form. A poorly crafted visualization or example could potentially lead to misunderstandings. However, I believe it's advantageous to commence with Newton's laws, even though general relativity offers a closer approximation to reality.
While the lack of visual explanations isn't solely responsible for my frustration, it's evident that as a visual learner growing up, I found myself in the minority. In a classroom where most students seemed to effortlessly grasp concepts, I often found myself as the sole voice expressing difficulty and seeking visual support.
Today, I dived deep into diffusion models, and once again, I encountered the frustration of a lack of visual and down-to-earth explanations. This experience drives me to document my current understanding, with the goal of creating an example that showcases my ideal explanation—one that incorporates visuals and practical examples, illustrating how I believe this topic should be presented.
Let's consider the scenario where we aim to find a function that accurately defines a volume resembling a cow. Our aim is to visually render this cow by sampling points inside its volume and illustrating them. With a sufficient number of sampled points, we can effectively draw the intricate details of a cow.
However, as we're aware, discovering the precise mathematical formula for that cow function poses a considerable challenge. Yet, even without this exact formula, can we continue sampling points in it? Especially considering the thousands of known points that have already been sampled from the cow's volume?
Let's envision the cow as a balloon. If we heat the balloon, the air within it expands; the air molecules become more agitated, colliding with each other and the balloon's boundaries, resulting in increased inner pressure. If we persist, the balloon eventually assumes the form of a perfect sphere. Thankfully, we possess a precise mathematical formula for a sphere, simplifying the process of sampling points in its volume. Now, if we can mimic the inverse process—cooling down the cow-shaped balloon—we can reposition points from inside the sphere to inside the cow volume. Even without discovering the mathematical formula defining the cow's volume, this approach enables us to achieve the task of drawing the cow.
As we heat the balloon and its enclosed air molecules, their movement within a set time frame increases. Another perspective to explain this phenomenon is that we disrupt the initial positions of the air molecules by introducing a random offset.
Considering the air molecules as our sampled points, the inflation process essentially introduces randomness to their positions, while the deflating process works to diminish this randomness, aiming to restore their initial positions.
This encapsulates the essence of diffusion models. In this context, all images—whether artificial or natural—exist within a hyper-dimensional distribution. This distribution isn't limited to cow-like shapes but spans a magnitude of complexity far beyond.
Directly modeling this distribution is an impossible task. However, we possess an abundance of samples belonging to this distribution—our training images. Our objective is to simulate the inflation process by introducing noise to the sample images, causing the initial distribution to transform into a sphere—a Gaussian distribution. Subsequently, our aim is to task a neural network with learning the reverse step: transforming a sample from the Gaussian distribution back to one belongs to the original image distribution.
The reverse process resembles training an artist to create a clearer portrait while observing the model through frosted glass. The artist hones this skill by practicing drawing while observing millions of models, both behind and in front of a frosted glass, enhancing their ability to capture details despite the obscured view.
When presented with a data-point x_0
sampled from the actual data distribution q(x) (x_0 \sim q(x))
, a forward diffusion process can be defined. This process involves adding Gaussian noise with variance \beta_{t}
to x_{t-1}
, thereby generating a new sample x_{t}
governed by the distribution q(x_{t} \mid x _{t-1} )
. This formulation of the diffusion process can be articulated as follows:
q(x_{t}|x_{t-1})= \mathcal{N}(x_{t}; \mu _{t} = \sqrt {1-\beta _{t}}x _{t-1}, \Sigma_{t} = \beta _{t} I)
To elucidate the symbols employed in the equation, let's look at their meanings. The notation q(x)
represents the real image distribution, and x_0
denotes an image sampled from this distribution, expressed as x_0 \sim q(x)
. Commencing from x_0
, a sequence of sampling steps is executed. At each step, starting with an existing sample x_{t-1}
, a new sample x_{t}
is generated within the distribution q(x_{t} \mid x_ {t-1} )
. This new distribution, denoted as q(x{t} \mid x _{t-1} )
, is constructed by centering a normal distribution at the sample point \sqrt {1-\beta _{t}}x {t-1}
, with a variance \Sigma{t} = \beta _{t} I
. If we envision a normal distribution or Gaussian as a sphere, it is defined by two key parameters: a mean, akin to the Gaussian's centroid, and a variance, akin to the square of the radius.
While grasping the radius part is relatively straightforward—where I
symbolizes the identity matrix, effectively yielding a unit length of 1, and is scaled by a coefficient \beta
to govern the radius's size (\sqrt{\beta _{t}}
) — the rationale behind multiplying the mean or centroid by \sqrt {1-\beta_{t}}
may not be immediately apparent.
For instance, consider a sample x _{t-1}
; introducing noise to it implies creating a new sample within its Gaussian neighborhood, centered at x _{t-1}
. But why is it necessary to scale this centroid?
To illustrate my intuition, let's turn to a 2D representation. Envision an initial distribution depicted as a 2D cloud; introducing noise to this distribution resembles inflating a balloon, expanding the cloud. Continuously doing so, the distribution would keep growing, which is undesirable, as it would lead to a large range and introduce mathematical instability issues. While we aim to benefit from the ever-simplifying shape of the distribution, we don't necessarily want it to expand limitlessly. Therefore, by pre-multiplying the initial distribution by a coefficient, we shrink its size before introducing noise. This allows us to control the growth in size, ultimately converging to a standard Gaussian with a variance equal to I
after numerous iterations.
A more formal explanation involves considering x {t-1}
sampled from an initial distribution with a variance I
. Our aim is to ensure that the variance of the new distribution q(x{t} \mid x_ {t-1} )
remains the same. This objective is achieved by scaling the initial distribution, and let's denote this scaling factor as B
. Its value can be calculated as follows:
\begin{aligned}
\mathrm{Var}(B*q(x _{t-1} \mid x _{t-2}) + q(x_{t} \mid x_ {t-1} )) &= 1 \\
B^2*\mathrm{Var}(q(x _{t-1} \mid x _{t-2})) + \mathrm{Var}(q(x_{t} \mid x_ {t-1} )) &= 1 \\
B^2 + \beta &= 1 \\
B &= \sqrt{1 - \beta}
\end{aligned}
To comprehend the derivation above, let's recall the formula \mathrm{Var}(A*q_1 + B*q_2) = A^2*\mathrm{Var}(q_1) + B^2*\mathrm{Var}(q_2) + 2*A*B*\mathrm{Cov}(q_1,q_2)
, where q_1
and q_2
represent two distributions. Given that, in our scenario, q_1
and q_2
are independent, the covariance term becomes zero. For a detailed proof, you can refer to this link
\begin{split}
\mathrm{Var}(X+Y) &= \mathrm{E}\left[ ((X+Y)-\mathrm{E}(X+Y))^2 \right] \\
&= \mathrm{E}\left[ ([X-\mathrm{E}(X)] + [Y-\mathrm{E}(Y)])^2 \right] \\
&= \mathrm{E}\left[ (X-\mathrm{E}(X))^2 + (Y-\mathrm{E}(Y))^2 + 2 \, (X-\mathrm{E}(X)) (Y-\mathrm{E}(Y)) \right] \\
&= \mathrm{E}\left[ (X-\mathrm{E}(X))^2 \right] + \mathrm{E}\left[ (Y-\mathrm{E}(Y))^2 \right] + \mathrm{E}\left[ 2 \, (X-\mathrm{E}(X)) (Y-\mathrm{E}(Y)) \right] \\
&= \mathrm{Var}(X) + \mathrm{Var}(Y) + 2 \, \mathrm{Cov}(X,Y) \\
\end{split}
By fixing the variance, the sole flexible component becomes the mean or the centroid. If the Gaussians exclusively alter their centroids, without changing their sizes, they essentially navigate through the space. Consequently, our primary focus shifts to the trajectory. The entire process has evolved into tracking and characterizing this trajectory.