Gaussian Process Classification

I am reading a paper, “Scalable Variational Gaussian Process Classification” that James Hensman, Alex Matthews and Zoubin Ghahramani’s work(https://arxiv.org/abs/1411.2005). The GP is basically fitting function for real value, but some methods incorporate classification into GP. But basically, GP has painful cost for its computation. This paper shows not only how to scale the model withing a variational inducing point framework, but also how to handle classification task on the method.

Basic Formulation of Classification on GP

As we know well, the data generation process from Gaussian Process is denoted as normal distribution. let’s say we have binary output from GP, the likelihood is described as

p(y, f) = \prod_{i=1}^{n} \mathcal{B}(y_i | \phi(f_i))\mathcal{N}(0, K_{nn})

\mathcal{B}(y_i | \phi(f_i))=\phi(f_i)^y_n(1-\phi(f_i))^{1-y_n}

As same as GP for fitting, the computation cost is highly depending on inversion of covariance matrix, the order is $N^3$ .

Training

Inducing Points

The method is going to augment the latent variables with additional input-output pairs Z, u knowing as inducing inputs and induction variables.

Fully Independent Training Conditional (FITC)

Instead of directly inverting covariance matrix, some methods eliminate the dependencies between variables.

Normally, p(y∣u) reflects the dependencies among these components. However, the formula above assumes that each y_i is independent of u, thus ignoring these dependencies and simplifying the computation. Originally, the joint distribution had correlations between $y_i$ and $y_j$ but these are removed in the approximation.

p(y | u) \approx \prod_{i=1}^{n} p(y_i | u)

This approximation method is same as approximation in regression setting like a Sparse Gaussian Process.

Posterior Inference

But in inference, things are quite different. Unlike posterior distribution of Gaussian distribution, we cannot solve posterior distribution analytically since it is non-Gaussian distribution.

When using FITC, Gaussian Process Classification approximates the posterior distribution using methods such as Laplace approximation or Expectation Propagation (EP).

I haven’t organize my writing yet, but I’ll continue when I have more time.