KPCA
KPCA
KPCA
Rita Osadchy
Some slides are due to Scholkopf, Smola, Muller, and Precup
Dimensionality Reduction
Data representation
Inputs are real-valued vectors in a high dimensional space.
Linear structure
Does the data live in a low dimensional subspace?
Nonlinear structure
Does the data live on a low dimensional submanifold?
PCA
Kernel PCA
Notations
Inputs (high dimensional) x1,x2,,xn points in RD Outputs (low dimensional)
Example in
2 R
: R2 R3
2 ( x1 , x2 ) a ( x1 , x2 , x12 + x2 )
Kernel Trick
High-dimensional mapping can seriously increase computation time. Can we get around this problem and still get the benefit of high-D? Yes! Kernel Trick K (xi , x j ) = ( xi )T ( x j ) Given any algorithm that can be expressed solely in terms of dot products, this trick allows us to construct different nonlinear versions of it.
Popular Kernels
Extract principal component in that space (PCA) The result will be non-linear in the original data space!
Derivation
Suppose that the mean of the data in the feature space is n Covariance:
1 = ( xi ) = 0 n i =1
1 n C = ( xi ) ( xi )T n i =1
Eigenvectors
Cv = v
Derivation Cont.
Eigenvectors can be expressed as linear combination of features:
v = i ( xi )
i =1
Proof:
1 n Cv = ( xi ) ( xi )T v = v n i =1
thus
1 n 1 n v= ( xi ) ( xi )T v = ( ( xi ) v) ( xi )T n i =1 n i =1
Showing that
xx v = ( x v) x
T
Showing that
xx v = ( x v) x
T
Derivation Cont.
So, from before we had,
1 n 1 n v= ( xi ) ( xi )T v = ( ( xi ) v) ( xi )T n i =1 n i =1
just a scalar
this means that all solutions v with = 0 lie in the span of ( x1 ),.., ( xn ) , i.e.,
v = i ( xi )
i =1
Derivation Cont.
By substituting this back into the equation we get:
n 1 n n ( xi ) ( xi )T jl ( xl ) = j jl ( xl ) n i =1 l =1 l =1
We can rewrite it as
n 1 n n ( xi ) jl K ( xi , xl ) = j jl ( xl ) n i =1 l =1 l =1
Derivation Cont.
By plugging in the kernel and rearranging we get:
K 2 j = n j K j
We can remove a factor of K from both sides of the matrix (this will only affects the eigenvectors with zero eigenvalue, which will not be a principle component anyway):
K j = n j j
We have a normalization condition for the j vectors:
v vj =1
T j
jl jk ( xl ) (xk ) = 1 jT K j = 1
T k =1 l =1
Derivation Cont.
By multiplying K j = n j j by j and using the normalization condition we get:
j n T j = 1, j j
For a new point x, its projection onto the principal components is:
( x) v j = ji ( x) ( xi ) = ji K ( x, xi )
T T i =1 i =1
1 n ( xk ) = ( xi ) ( xk ) n k =1 ~
The corresponding kernel is:
~ ~ T ~ K ( xi , x j ) = ( xi ) ( x j ) 1 1 n = ( xi ) ( xk ) ( x j ) ( xk ) n k =1 n k =1 1 n 1 n 1 = K ( xi , x j ) K ( xi , xk ) K ( x j , xk ) + 2 n k =1 n k =1 n
n T
l , k =1
K (x , x )
l k
K (x , x
l
In a matrix form
~ K = K - 211/n K + 11/n K11/n
y j = ji K ( x, xi ), j = 1,.., d
i =1
Example: KPCA
Properties of KPCA
Kernel PCA can give a good reencoding of the data when it lies along a non-linear manifold. The kernel matrix is n x n, so kernel PCA will have difficulties if we have lots of data points.