October 19, 2024

About

The Kernel PCA and the GPLVM, both are using Kernel function and the purpose of these are almost same, unsupervised learning. So, what is the difference? and how should we choose it for our use cases?

Kernel PCA

The Kernel PCA is a method for extending linear PCA to nonlinear data. It involves nonlinearly mapping data into a high-dimensional feature space and then applying regular PCA in that space, which makes it possible to extract principal components from data with a nonlinear structure.

The first of all, the Kernel PCA computation flow is to compute kernel matrix, so called Gram Matrix(1). And then, the kernel matrix needs to be centered in the feature space(2).

\begin{align}
K_{ij} = kernel(x_i, x_j)\\
K' = (I - \frac{1}{n}\mathbf{1}\mathbf{1}^T)K(I - \frac{1}{n}\mathbf{1}\mathbf{1}^T)
\end{align}

And in the last step, solve Eigenvalue problem of centered kernel matrix.

\begin{align}
 K'v = \lambda v
\end{align}

Choose the eigenvector that has the largest eigenvalue, and use these to form a new representation of the data (principal components). What’s important here is that kernel parameters or kernel itself are never optimized or chosen.

You could get principle component value of new points using this eigen vectors and mean values.

  1. Create a kernel vector by computing the kernel function for each combination between the new input point and the trained data points.
  2. Subtract the mean vector from the kernel vector of the input to centralize the data.
  3. Project the centralized kernel vector from step 2 onto the eigenvectors, and then obtain the principal components.

Gaussian Process Latent Value Models(GPLVM)

When we assume that a latent variable X follows a multivariate normal distribution, we consider a model in which data Y is generated under this latent variable. This is the model known as Gaussian Process Latent Value Models(GPLVM).

The joint probability in this modeling can be described as follows:

\begin{align}
p(\mathbf{X}) = \prod_{i=1}^{N} \mathcal{N}(\mathbf{x}_i | \mathbf{0}, \mathbf{I})\\
p(Y, X) = p(Y|X)p(X)= \prod_{i=1}^{D} \mathcal{N}(\mathbf{y^i}, K_X + \sigma^2 \mathbf{I}) \prod_{i=1}^{N} \mathcal{N}(\mathbf{x}_i | \mathbf{0}, \mathbf{I})\\
\end{align}

Additionally, ( K_X ) is the kernel matrix representing the relationships among the elements of the latent variable ( X ).

Since multivariate normal distribution is written as follows,

p(\mathbf{x} | \boldsymbol{\mu}, \Sigma) = \frac{1}{\sqrt{(2\pi)^n |\Sigma|}} \exp\left(-\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)

the probability of Y given X is described as

\begin{align}
p(\mathbf{Y} | \boldsymbol{X}) = \prod_{i=1}^{D} \mathcal{N}(\mathbf{0}, K_X + \sigma^2 \mathbf{I})\\
=\frac{1}{\sqrt{(2\pi)^{ND} |{K_X+\sigma^2I}|}} \exp\left(-\frac{1}{2} \sum_{d=1}^{D} \mathbf{y}^{(d)T} (K_X+\sigma^2I)^{-1} \mathbf{y}^{(d)}\right)\\
=\frac{1}{\sqrt{(2\pi)^{ND} |{K_X+\sigma^2I}|}} \exp\left(-\frac{1}{2}  \mathbf{Y}^T (K_X+\sigma^2I)^{-1} \mathbf{Y}\right)\\
\end{align}

here Y matrix means data matrix.

By taking the log likelihood of this probability, and taking derivatives of the likelihood,

\log p(\mathbf{Y} | \mathbf{X}) = -\frac{ND}{2} \log (2\pi) - \frac{1}{2} \log |K_X+\sigma^2 I| - \frac{1}{2} \mathbf{Y}^T (K_X+\sigma^2 I)^{-1} \mathbf{Y}

the kernel parameters on matrix Kx are optimized.

So… What is the difference

  • Kernel PCA
    • There is no flexibility in choosing parameters
    • The computation is simple
  • GPLVM
    • probabilistic model
    • It can compute even with missing values
    • It does not yield a global optimum
    • The number of dimensionality can be specified