January 18, 2025

Bottleneck of computing Exact Gaussian Process

Suppose we have n-dimensional Gaussian Process below which consist of n number of data.

\begin{align}
f_n 
\sim \mathcal{N} \left(
\mu_n, 
\Sigma_{nn}\right)
\end{align}

To represent this distribution, n-dimensional covariance matrix is used.

\Sigma_{nn} = \begin{pmatrix}
k(\mathbf{x}_1, \mathbf{x}_1) & \cdots & k(\mathbf{x}_1, \mathbf{x}_N) \\
\vdots & \ddots & \vdots \\
k(\mathbf{x}_N, \mathbf{x}_1) & \cdots & k(\mathbf{x}_N, \mathbf{x}_N)
\end{pmatrix}

And to infer posterior distribution, the inverted covariance matrix should be computed.

\begin{align}
\mu_{*|Y} = \mu_* + \Sigma_{*n} \Sigma_{nn}^{-1} (f_n - \mu_n)\\
\Sigma_{*|Y} = \Sigma_{**} - \Sigma_{*n} \Sigma_{nn}^{-1} \Sigma_{n*}
\end{align}

Since the Exact GP uses n-number of data as it is for representing distribution, this computation can be bottleneck in big data era. For example, this algorithm has a computational complexity of \mathcal{O}(n^3), \mathcal{O}(n^2) of memory is needed. This things is what inducing points is going to solve.

Inducing Points and its purpose

What the inducing point is intended is to do approximation. What if we can represent \Sigma_nn using small number of parameters or data, m? The computational costs becomes \mathcal{O}(m^3), \mathcal{O}(m^2) ( m << n)

The general idea is to approximation like below

\Sigma_{nn}^{-1} \approx \Sigma_{nm}\Sigma_{mm}^{-1}\Sigma_{mn}

It is so called Subsets of Regressors(SoR) approximation. ( it seems like Nyström approximation or low-rank approximation.)

Algorithms that utilize Inducing Points

KISS-GP is a method that uses a fixed grid structure of inducing points to efficiently approximate the kernel matrix. SKI extends KISS-GP by introducing more flexible interpolation techniques. SKIP further builds upon SKI to make it applicable to higher-dimensional data.

However, these methods are extensions of Exact Gaussian Processes (ExactGP) and do not fully eliminate the computational bottleneck. To address this, we introduce variational algorithms that, at first glance, resemble the modeling approach of inducing point methods.

Modeling

In order to see how this approximation is conducted it in detail, I will write down detailed graphical model of a Gaussian process as follows.

\begin{align}
p(y | f) = \mathcal{N}\left( y \mid f, \beta^{-1} \mathbf{I} \right)\\
p(f | u) = \mathcal{N}\left( f \mid \mathbf{K}_{nm} \mathbf{K}_{mm}^{-1} u, \mathbf{K}_{e} \right)\\
p(u) = \mathcal{N}(u \mid 0, \mathbf{K}_{mm})\\
\end{align}

However, the important point is that this matrix K_{nn} in K_e is included in the model. It turns out that this computational complexity is high.

\mathbf{K}_e = \mathbf{K}_{nn} - \mathbf{K}_{nm} \mathbf{K}_{mm}^{-1} \mathbf{K}_{mn}

In this paper[1], the correlation between n sampled points are removed so that K_{nn} is diagonal. This approximation may appear quite bold, but it allows us to significantly simplify the problem. And this modeling might be good enough if we carefully choose appropriate inducing points.

Optimization

So many methodologies of optimization are suggested, actually. One of the most important and famous method is Variational Inference, it is sometimes called as Sparse Variational Gaussian Process[1]. Some approaches are suggested, for example

\log p(y | u) = \log \mathbb{E}_{p(f | u)} \left[ p(y | f) \right] 
\geq \mathbb{E}_{p(f | u)} \left[ \log p(y | f) \right] = \mathcal{L}_1

It says that the more rough approximation you used, the faster computational complexity you get. But we should take care of fitting capability, which is the most important for Bayesian Optimization, to get quick convergence.

Reference

[1] James Hensman, Nicolo Fusi, Neil D. Lawrence “Gaussian Process for Big Data” https://arxiv.org/abs/1309.6835