MathematicsFeatured

The Mathematics of Latent Space in Generative Models

From Gaussian priors to disentangled representations — a visual guide

February 5, 202620 min read

Abstract

Latent space is the compressed, continuous manifold learned by generative models. This article derives the Evidence Lower Bound (ELBO) from first principles, unpacks the role of KL divergence as a regulariser, and explores how the geometry of the latent manifold directly governs sample quality, interpolation smoothness, and disentanglement.

#VAE#Latent Space#Information Theory#Generative Models#KL Divergence

1. Why Latent Representations?

High-dimensional data — images, text, audio — lies on a much lower-dimensional manifold. A 256×256 image has 65,536 pixel dimensions, but the space of natural images occupies an astronomically smaller subspace. Generative models learn to map this subspace to a tractable latent distribution, enabling generation, interpolation, and controlled editing.

2. The Variational Autoencoder Framework

A VAE learns two distributions: an encoder q_phi(z|x) that approximates the true posterior p(z|x), and a decoder p_theta(x|z) that reconstructs data from latent codes. Because the true posterior is intractable, we maximise a lower bound on the log-likelihood called the ELBO.

Evidence Lower Bound (ELBO)

L(theta, phi; x) = E_{q_phi(z|x)}[log p_theta(x|z)] - KL(q_phi(z|x) || p(z))

3. The Role of KL Divergence

The KL divergence term acts as a regulariser, pushing the approximate posterior towards the prior p(z) = N(0, I). When both are Gaussian, this has a closed-form solution. Without this term, the model collapses to a standard autoencoder with a discontinuous, non-generative latent space.

KL Divergence (Gaussian closed form)

KL(N(mu, sigma^2) || N(0,1)) = -0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)

4. Geometry and Disentanglement

Disentanglement means that individual latent dimensions correspond to independent generative factors (e.g., rotation, scale, color). beta-VAE amplifies the KL term by a factor beta > 1 to encourage axis-aligned representations. The geometric intuition: a higher beta forces a more spherical, isotropic posterior, reducing entanglement between dimensions.

beta-VAE Objective

L_beta = E[log p_theta(x|z)] - beta * KL(q_phi(z|x) || p(z))

References

[1]

Auto-Encoding Variational Bayes

Kingma & Welling, 2013Link
[2]

beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework

Higgins et al., 2017Link