October 19, 2024

The Evidence Lower Bound(ELBO) is a fundamental concept in variational inference, a technique used for approximating complex probability distributions in Bayesian statistics. It revolves around the idea of optimizing a simpler distribution to approximate a more complex one, particularly in contexts where exact inference is computationally infeasible. When to consider loss function between distributions on VAE or Gaussian process, it could be useful.

What we want is to have data distribution, and its approximation by using latent variable. If Z is simple distribution like Gaussian, it can be regarded as Gaussian Mixture Model which is consist of integrated(or accumulated) simple models.

p(\mathbf{X}) = \int p(\mathbf{X} | \mathbf{Z}) p(\mathbf{Z}) d\mathbf{Z}

But the problem is, we cannot observe Z itself, so, we should estimate distribution of Z. But there is another problem here, if we get additional data, the shape of distribution get changed. So, we are going to use approximation function.

The Evidence Lower Bound (ELBO) and Variational Inference

The goal is to approximate the posterior distribution p(Z | X) with approximate function q(Z), where Z is latent variable, and X is data. Let’s see Bayse’s rule here.

p(\mathbf{Z} | \mathbf{X}) = \frac{p(\mathbf{X} | \mathbf{Z}) p(\mathbf{Z})}{p(\mathbf{X})}

There is difficulty to get p(X), because integration is needed.

So, Variational inference handles this by choosing a simpler, parameterized distribution q(Z) and optimizing its parameters such that it closely approximates p(Z|X).

p(Z|X) \approx q(z|x) \approx q_i(z)

The variational inference avoids direct computation of p(X) by maximizing the ELBO, which is derived from the log-evidence.

\begin{align}
\log p(x) = log \int_{z} p(x, z)dz = log \int_{z} p(x, z)\frac{q(x|z)}{q(x|z)} dz \\
=\log\mathbb{E}_{q(z)}[\frac{p(x, z)}{q(x|z)}]
\end{align}

Using Jensen’s inequality,

\begin{align}
=\log\mathbb{E}_{q(Z)}[\frac{p(x, z)}{q(x|z)}]
\geq \mathbb{E}_{q(z)}[log \frac{p(x, z)}{q(z | x)}]\\ = \mathbb{E}_{q(z)}[\log p(x, z)] - \mathbb{E}_{q(z)}[\log q(z | x)]\\
= \mathbb{E}_{q(z)}[\log p(x, z)]-H(q(z | x))=ELBO
\end{align}

H() is Shannon entropy.

Kullback-Leibler Divergence

The concept of Kullback-Leibler Divergence(KL Divergence) which means distance between stochastic distribution and another, is quite important to understand ELBO. When we would like to approximate p(z) using q(z), the distance is formed as below,

\text{KL}(q(z) \parallel p(z)) = \int q(z) \log \frac{q(z)}{p(z)} dz

if you consider KL distance between q and p, and extend p(z|x) using Bayesian rule, KL between q and p is

\begin{align}
   \text{KL}(q_\phi(z|x) \| p(z|x)) = \mathbb{E}_{q_\phi(z|x)}[\log q_\phi(z|x) - \log p(z|x)]\\
   \log p(z|x) = \log p(x|z) + \log p(z) - \log p(x)\\
   \text{KL}(q_\phi(z|x) \| p(z|x)) = \mathbb{E}_{q_\phi(z|x)}[\log q_\phi(z|x) - \log p(x|z) - \log p(z) + \log p(x)]
\end{align}

the term log p(x) and ELBO are canceled, and revisit ELBO,

\begin{align}
   p(x, z) = p(x|z) p(z)\\
   \text{ELBO} = \mathbb{E}_{q_\phi(z|x)}[\log p(x, z) - \log q_\phi(z|x)]\\
   = \mathbb{E}_{q_\phi(z|x)}[\log p(x|z) + \log p(z) - \log q_\phi(z|x)]\\
\end{align}

we would get this lop p(X) formulation which is represented using ELBO

\begin{align}
  \log p(\mathbf{X}) = \text{ELBO} + D_{\text{KL}}(q(z|x) \,||\, p(z|x))\\
 = \text{ELBO}+\mathbb{E}_{q(z)}[\log p(z | x)]
\end{align}

The DKL means Kullback-Leibler Divergence.

You can decompose 2nd term as

\begin{align}
\mathbb{E}_{q(z)}[\log p(z | x)] = \mathbb{E}_{q(z)}[\log q(z)] - D_{KL}(q(z) \parallel p(z | x))\\
\end{align}

Since KL (q(Z) | p(Z|X)) takes always non-negative,

\text{KL}(q(z) \parallel p(z | x))\geq0

ELBO takes a value which is slightly lower value than log p(X)

\begin{align}
\log p(x) = \text{ELBO} + \text{KL}(q(z) \parallel p(z | x))\\
\log p(x) \geq \text{ELBO} 
=\mathbb{E}_{q(z)}[\log p(x | z)] - \text{KL}(q(z) \parallel p(z))
\end{align}

What does this formulation mean?

By maximizing the ELBO, we effectively minimize the KL Divergence between q(Z) and the true posterior p(Z∣X). (1st term) The first term in (2) means that reconstruction error when data X is reconstructed from latent variable Z. (2nd term) and the second term shows that how distance between approximate function q(Z) and p(Z) are closer to. This term should work as requralizer

Reference

https://yunfanj.com/blog/2021/01/11/ELBO.html