Kernel Methods
George Lan
A. Russell Chandler III Chair Professor
H. Milton Stewart School of Industrial & Systems
Engineering
Nonlinear regression
Want to fit a polynomial regression model
𝑦 = 𝜃! + 𝜃" 𝑥 + 𝜃# 𝑥 # + ⋯ + 𝜃$ 𝑥 $ + 𝜖
Let 𝑥( = 1, 𝑥, 𝑥 # , … , 𝑥 $ % and 𝜃 = 𝜃! , 𝜃" , 𝜃# , … , 𝜃$ %
y = 𝜃 % 𝑥(
2
Problem of explicitly constructing features
Explicitly construct feature 𝜙 𝑥 : 𝑅$ ↦
𝐹, feature space can grow really large
and really quickly
Eg. Polynomial feature of degree 𝑑
𝑥!" , 𝑥! 𝑥# … 𝑥" , 𝑥!# 𝑥# … 𝑥"$!
Total number of such feature is
𝑑+𝑛−1 𝑑+𝑛−1 !
=
𝑑 𝑑! 𝑛 − 1 !
𝑑 = 6, 𝑛 = 100, there are 1.6 billion
terms
3
Can we avoid expanding the features?
Rather than computing the features explicitly, and then compute
inner product.
Can we merge two steps using a clever function 𝑘(𝑥& , 𝑥' )
Eg. Polynomial 𝑑 = 2
!
𝑥"# 𝑦"#
𝑥" 𝑥# 𝑦" 𝑦# # # # #
𝜙 𝑥 !𝜙 𝑦 = # #
= 𝑥 " 𝑦" + 2𝑥" 𝑥# 𝑦" 𝑦# + 𝑥# 𝑦#
𝑥# 𝑦#
𝑥# 𝑥" 𝑦# 𝑦"
= 𝑥" 𝑦" + 𝑥# 𝑦# # = 𝑥 ! 𝑦 #
𝑂 𝑛 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛!
Polynomial kernel degreee 𝑑, 𝑘 𝑥, 𝑦 = 𝑥 % 𝑦 ( = 𝜙 𝑥 %𝜙 𝑦
4
Typical kernels for vector data
Polynomial of degree d
𝑘 𝑥, 𝑦 = 𝑥 % 𝑦 "
Polynomial of degree up to d
𝑘 𝑥, 𝑦 = 𝑥 % 𝑦 + 𝑐 "
Exponential kernel (infinite degree polynomials)
𝑘 𝑥, 𝑦 = exp(𝑠 ⋅ 𝑥 % 𝑦)
Gaussian radial basis function (RBF) kernel
&$' !
𝑘 𝑥, 𝑦 = exp −
#( !
Laplace Kernel
&$'
𝑘 𝑥, 𝑦 = exp −
#( !
Exponentiated distance
" &,' !
𝑘 𝑥, 𝑦 = exp −
*! 5
Feature space not unique!
Eg. Polynomial 𝑑 = 2
%
𝑥!# 𝑦!#
𝑥! 𝑥# 𝑦! 𝑦#
𝜙 𝑥 %𝜙 𝑦 = = 𝑥!# 𝑦!# + 2𝑥! 𝑥# 𝑦! 𝑦# + 𝑥## 𝑦##
𝑥## 𝑦##
𝑥# 𝑥! 𝑦# 𝑦!
= 𝑥" 𝑦" + 𝑥# 𝑦# # = 𝑥 ! 𝑦 #
%
𝑥!# 𝑦!#
𝜙 𝑥 %
𝜙 𝑦 = 2𝑥! 𝑥# 2𝑦! 𝑦# = 𝑥!# 𝑦!# + 2𝑥! 𝑥# 𝑦! 𝑦# + 𝑥## 𝑦##
𝑥## 𝑦##
= 𝑥" 𝑦" + 𝑥# 𝑦# # = 𝑥 ! 𝑦 #
6
What 𝑘(𝑥, 𝑦) can be called a kernel function?
𝑘(𝑥, 𝑦) equivalent to first compute feature 𝜙(𝑥), and then
perform inner product 𝑘 𝑥, 𝑦 = 𝜙 𝑥 % 𝜙 𝑦
A dataset 𝐷 = 𝑥" , 𝑥# , 𝑥5 … 𝑥$
Compute pairwise kernel function 𝑘 𝑥& , 𝑥' and form a 𝑛×𝑛
kernel matrix (Gram matrix)
𝑘(𝑥! , 𝑥# ) … 𝑘(𝑥! , 𝑥+ )
𝐾= ⋮ ⋱ ⋮
𝑘(𝑥+ , 𝑥! ) … 𝑘(𝑥+ , 𝑥+ )
𝑘(𝑥, 𝑦) is a kernel function, iff matrix 𝐾 is positive semidefinite
∀𝑣 ∈ 𝑅 + , 𝑣 % 𝐾𝑣 ≥ 0
7
Support Vector Machines
Primal problem:
!
min 𝑤 % 𝑤 + 𝐶 ∑. 𝜉 .
,,- #
𝑠. 𝑡. 𝑤 % 𝑥 . + 𝑏 𝑦 . ≥ 1 − 𝜉 . , 𝜉 . ≥ 0, ∀𝑗
Can be high order
polynomial features
Lagrangian 𝜙 𝑥$
!
𝐿 𝑤, 𝜉, 𝛼, 𝛽 = 𝑤 % 𝑤 + 𝐶 ∑. 𝜉 . + ∑. 𝛼. (1 − 𝜉 . − P𝑤 % 𝑥 . +
#
𝑏 Q𝑦 . ) − β . 𝜉 .
𝛼/ ≥ 0, 𝛽/ ≥ 0
Take derivative of 𝐿 𝑤, 𝜉, 𝛼, 𝛽 with respect to 𝑤 and 𝜉 we
have
𝑤 = ∑. 𝛼. 𝑦 . 𝑥 .
𝑏 = 𝑦0 − 𝑤 % 𝑥0 for any 𝑘 such that 0 < 𝛼0 < 𝐶 8
SVM dual problem and kernelize
Plug in 𝑤 and 𝑏 into the Lagrangian, and the dual problem
! / . /% .
𝑀𝑎𝑥1 ∑/ 𝛼/ − ∑/,. 𝛼/ 𝛼. 𝑦 𝑦 𝑥 𝑥
#
𝑠. 𝑡. ∑/ 𝛼/ 𝑦 / = 0 Can be replaced by
#
0 ≤ 𝛼/ ≤ 𝐶 𝜙 𝑥 " 𝜙(𝑥 $ )
and 𝑘(𝑥 " , 𝑥 $ )
Other steps can also depend only on inner product
% . .%
𝑤 𝑥 = ∑. 𝛼. 𝑦 𝑥 𝑥
𝑏 = 𝑦0 − 𝑤 % 𝑥0 for any 𝑘 such that 0 < 𝛼0 < 𝐶
9
Illustration of kernel SVM
Kernel SVM
implicitly map data to a new nonlinear feature space
find linear decision boundary in the new space
10
Ridge regression and matrix inversion lemma
Matrix inversion lemma (𝐵 ∈ 𝑅$×H ):
𝐵𝐵% + 𝜆𝐼 I" 𝐵 = 𝐵 𝐵% 𝐵 + 𝜆𝐼 I"
Note that 𝑋 = (𝑥 " , 𝑥 (#) , … 𝑥 (H) )
Evaluate ridge regression solution: 𝜃 J = 𝑋𝑋 % + 𝜆𝐼 I" 𝑋𝑦 on a
new test point 𝑥
𝑥 % 𝜃 J = 𝑥 % 𝑋𝑋 % + 𝜆𝐼 I" 𝑋𝑦
= 𝑥 % 𝑋 𝑋 % 𝑋 + 𝜆𝐼 I" 𝑦
Only dependent on inner product between data points
11
Kernel ridge regression
𝑓 𝑥 = 𝜃 % x = 𝑦 % 𝑋 % 𝑋 + 𝜆𝐼+ $!
𝑋 % 𝑥 only depends on inner
products!
𝑥"! 𝑥" … 𝑥"! 𝑥%
𝑋!𝑋 = ⋮ ⋱ ⋮
𝑥%! 𝑥" … 𝑥%! 𝑥%
𝑥"! 𝑥
𝑋!𝑥 = ⋮
𝑥%! 𝑥
Kernel ridge regression: replace inner product by a kernel function
𝑋 ! 𝑋 → 𝐾 = 𝑘 𝑥& , 𝑥$
%×%
𝑋 ! 𝑥 → k ( = 𝑘 𝑥& , 𝑥 %×"
𝑓 𝑥 = 𝑦 ! 𝐾 + 𝜆𝐼% )"
𝑘*
12
Kernel ridge regression
Use Gaussian rbf kernel
*)+ !
𝑘 𝑥, 𝑦 = exp −
#, !
𝑙𝑎𝑟𝑔𝑒 𝜎, 𝑙𝑎𝑟𝑔𝑒 𝜆 𝑠𝑚𝑎𝑙𝑙 𝜎, 𝑠𝑚𝑎𝑙𝑙 𝜆 𝑠𝑚𝑎𝑙𝑙 𝜎, 𝑙𝑎𝑟𝑔𝑒 𝜆
Use cross-validation to choose parameters
13
Principal component analysis
Given a set of 𝑚 centered
observations 𝑥 & ∈ 𝑅( , PCA finds
the direction that maximizes the
variance
𝑋 = 𝑥!, 𝑥 #, … , 𝑥 2
∗ ! % / #
𝑤 = 𝑎𝑟𝑔𝑚𝑎𝑥 , 4! ∑/ 𝑤 𝑥
2
!
= 𝑎𝑟𝑔𝑚𝑎𝑥 , 4! 𝑤 % 𝑋𝑋 % 𝑤
2
" %, 𝑤 ∗
𝐶= H
𝑋𝑋 can be found by
solving the following eigen-value
problem
𝐶𝑤 = 𝜆 𝑤
14
Alternative expression for PCA
The principal component lies in the span of the data
𝑤 = I 𝛼& 𝑥 & = 𝑋𝛼
&
! ! !
𝑤 = 𝐶𝑤 = 𝑋𝑋 % w= 𝑋( 𝑋 % w)= 𝑋𝛼 for any 𝜆 > 0.
5 52 52
Plug this in we have
1
𝐶𝑤 = 𝑋𝑋 % 𝑋𝛼 = 𝜆𝑤 = 𝜆 𝑋𝛼
𝑚
Furthermore, for each data point 𝑥 & , the following relation
holds
/% ! /% % /%
Only depends on
𝑥 𝐶𝑤 = 𝑥 𝑋𝑋 𝑋𝛼 = 𝜆𝑥 𝑋𝛼, ∀𝑖 inner product matrix
2
! %
In matrix form, 𝑋 𝑋𝑋 % 𝑋𝛼 = 𝜆𝑋 % 𝑋𝛼
2
15
Kernel PCA
Key Idea: Replace inner product matrix by kernel matrix
"
PCA: 𝑋 ! 𝑋𝑋 ! 𝑋𝛼 = 𝜆𝑋 ! 𝑋𝛼
-
Kernel PCA:
𝑥. ↦ 𝜙 𝑥. , X ↦ Φ = 𝜙 𝑥" , … , 𝜙 𝑥. , 𝐾 = Φ ! Φ
Nonlinear principal component 𝑤 = Φ𝛼
" "
𝐾𝐾𝛼 = 𝜆𝐾𝛼, equivalent to 𝐾𝛼 = 𝜆 𝛼
- -
The solutions of the above two linear systems differ only for eigenvectors
of 𝐾 with zero eigenvalue.
! !
𝐾(" 𝐾𝛼 − 𝜆𝛼) = 0, " 𝐾𝛼 − 𝜆𝛼 can not belong to the null space of 𝐾 since
neither 𝐾𝛼 nor 𝛼 does (under the assumption that 𝐾𝛼 is nonzero.
Key computation: form an 𝑚 by 𝑚 kernel matrix 𝐾, and then
perform eigen-decomposition on 𝐾
16
Kernel PCA example
Gaussian radial basis function (RBF) kernel
"
TIT !
exp − #U"
over 2 dimensional space
Eigen-vector evaluated at a test point 𝑥 is a function
𝑤 % 𝜙 𝑥 = ∑& 𝛼& < 𝜙 𝑥 & , 𝜙 𝑥 > = ∑& 𝛼& 𝑘(𝑥 & , 𝑥)
17
Canonical correlation analysis
18
Canonical correlation analysis
Given 𝐷 = 𝑥" , 𝑦" , … , 𝑥 H , 𝑦 H ~𝑃(𝑥, 𝑦)
𝑋 = (𝑥 ! , 𝑥 # , … , 𝑥 2 )
𝑌 = (𝑦 ! , 𝑦 # , … , 𝑦 2 )
Find two vectors 𝑤T and 𝑤V , and project the data respectively
𝑋𝑤& and 𝑌𝑤'
Such that the correlations of the projected data are maximized
𝜌 = max 𝑐𝑜𝑟𝑟 𝑋𝑤T , 𝑌𝑤V
W#,W$
< 𝑋𝑤T , 𝑌𝑤V >
= max
W#,W$ 𝑋𝑤T 𝑌𝑤V
19
Matrix form of CCA
Define the covariance matrix of 𝑥, 𝑦
𝑥 𝑥 % 𝐶TT 𝐶TV
𝐶 = 𝔼(T,V) 𝑦 𝑦 =
𝐶VT 𝐶VV
The optimization problem is equal to
𝑤T% 𝐶TV 𝑤V
𝜌 = max
W#,W$
𝑤T% 𝐶TT 𝑤T 𝑤V% 𝐶VV 𝑤V
20
CCA as generalized eigenvalue problem
The optimality conditions say
C xy wy = lC xx wx
C yx wx = lC yy wy
,%& >%' ,'
𝜆= (set the gradient equal to zero).
,%& >%% ,%
Put these conditions into matrix format
æ 0 C xy öæ wx ö æ C xx 0 öæ wx ö
ç ÷ç ÷ = l ç ÷ç ÷
çC 0 ÷øçè wy ÷ø ç 0 C yy ÷øçè wy ÷ø
è yx è
Generalized eigenvalue problem 𝐴𝑤 = 𝜆𝐵𝑤
21
CCA in inner product format
Similar to PCA, the directions of projection lie in the span of
the data 𝑋 = 𝑥" , 𝑥 # , … , 𝑥 H , 𝑌 = (𝑦" , 𝑦 # , … , 𝑦 H )
𝑤& = 𝑋𝛼, 𝑤' = 𝑌𝛽
! !
𝐶&' = 𝑋𝑌 % , 𝐶&& = 𝑋𝑋 % , 𝐶'' = 𝑌𝑌 %
2 2
W#%X#$W$
Earlier we have 𝜌 = max
W#,W$ W %X W W %X W
# ## # $ $$ $
Plug in 𝑤T = 𝑋𝛼, 𝑤V = 𝑌𝛽
Data only appear in
inner products
a T X T XY T Yb
r = max
a ,b
a T X T XX T Xa b T Y T YY T Yb
22
Kernel CCA
Replace inner product matrix by kernel matrix
a T KxK yb
r = max
a ,b
a K x K xa b K y K y b
T T
Where 𝐾T is kernel matrix for data 𝑋, with entries 𝐾T 𝑖, 𝑗 =
𝑘 𝑥& , 𝑥'
Solve generalized eigenvalue problem
æ 0 K x K y öæ a ö æ KxKx 0 öæ a ö
ç ÷çç ÷÷ = l ç ÷çç ÷÷
çK K 0 ÷øè b ø ç 0 K y K y ÷øè b ø
è y x è
23