New Schedule: Fall 2004 Pattern Recognition For Vision
New Schedule: Fall 2004 Pattern Recognition For Vision
New Schedule: Fall 2004 Pattern Recognition For Vision
Dec. 8
same
same
same
Oct. 21
^2 weeks
^1 week
^1 week
Classification
Bernd Heisele
Fall 2004
Overview
Introduction
Classification
• Linear, non-linear
separation
• Two class, multi-class
problems
Two approaches:
• Density estimation, classify with Bayes decision:
Linear Discr. Analysis (LDA), Quadratic Discr. Analysis (QDA)
• Without density estimation: Support Vector Machines (SVM)
Fall 2004 Pattern Recognition for Vision
LDA Bayes Rule
Bayes Rule
likelihood prior
p ( x | w ) P (w )
P(w | x ) =
p(x)
posterior evidence
x : random vector
w : class
P(w1 | x
)
Decide w1 if > 1; otherwise w 2
P(w 2 | x
)
Likelihood Ratio
p(x | w1 ) P (w1 )
>1
p(x | w 2 ) P (w 2 )
1
1 - ( x - m i ) T S i-1 ( x- m i )
Gaussian: p ( x | w i ) = e 2
(2p ) d /2
| Si | 1/2
-1 1 -1 � P (w1 ) �
= x S ( m1 - m 2 ) + ( m1 + m 2 ) S ( m 2 - m1 ) + ln �
T T
�
����� 2 Ł P (w 2 ł
)
w ������� ��������� �
b
m1¢
m 2¢
f ( x¢) = 0
1
f (x ) = x S ( m1 - m 2 ) + ( m1 + m 2 )T S -1 ( m 2 - m1 ) + ln( P (w1 ) / P (w 2 ))
T -1
2
1
= x ( m1 - m2 ) + ( m1¢ + m 2¢ )T ( m 2¢ - m1¢) + ln( P (w1 ) / P (w2 )),
¢T
¢ ¢
��������
2 ���������
b
1 Ni
�
mˆ i = �
Nin =1
x i ,n �
�
N � Density estimation
T �
2
1
�� i,n i i,n i �
i
Sˆ = ( x - m
ˆ )( x - m
ˆ )
N1 + N 2 i =1 n =1 �
f (x) = sign(xT w + b)
w = Sˆ -1 ( mˆ - mˆ )
1 2
1 -1 � P (w1 ) �
b = ( mˆ1 + mˆ 2 ) S ( mˆ 2 - mˆ1 ) + ln �
T
�
2 Ł�P (w 2 ł
)
��� �
N1
Approximate by
N2
A = - ( S1 - S-2 1 )
1 -1
where - a matrix
2
w = S1-1m1 - S-21m 2 - a vector
wo = ... well, the rest of it - a scalar
p is dimension of x
X X¢
y2 ( m1 )
m1 m1¢
m2
w2 m 3¢ m 2¢
m3
w1
y1 ( m1 )
Find the n - dimensional subspace that gives the
best linear discrimination betw. the k classes.
y = ( w1 | w 2 |...| w n )T x
also known as Fisher Linear Discriminant
Computation
Advantages:
Problems:
• LDA is based on a single prototype per class (class center) which
is often insufficient in practice.
minimization.
or alternatively: w = w¢ / d , b = b¢ / d
1 1
min w subject to yi ( xi w + b) ‡ 1, where d =
2 T
w ,b 2 w
Convex optimization problem with quadratic objective function and
linear constraints.
2 i =1
Primal:
w
min subject to: yi ( xiT w + b) ‡ 1
w ,b 2
Dual:
N
1 N N N
max � a i - � � a iak yi yk xiT xk subject to: a i ‡ 0, �a y i i =0
ai 2 k =1 i =1
i =1 i =1
The primal has a dense inequality constraint for every point in the
training set. The dual has a single dense equality constraint and a set
of box constraints which makes it easier to solve than the p rimal.
� a i yi = 0, a i ‡ 0 "i,
i =1
yi ( xiT w + b) - 1 ‡ 0 "i
N N
w = � a i yi xi � f ( x) = � a i yi xi T x + b
i =1 i =1
a ‡0
f ( x) = 1
f ( x) = 0
f ( x ) = -1
xi
2d = M
N
1
min w +C � x i subject to:
2
w ,b 2
i=1
N
1 N N
max. LD = � a i - �� a ia k yi yk x iT x k
i =1 2 k =1 i =1
N
subject to 0 £ a i £ C , �a y
i =1
i i =0
f ( x) = 1
f ( x) = 0
f ( x ) = -1
xi
2d = M
N
f ( x ) = � a i yi x i T x + b
i =1
a i = 0 � xi = 0, yi f ( xi ) > 1
0 < a i < C � xi = 0 yi f ( xi ) = 1 unbounded support vectors
a i = C � xi ‡ 0, yi f (x i ) £ 1 bounded support vectors
Fall 2004 Pattern Recognition for Vision
SVM—Non-linear (NL)
Non-linear mapping:
x¢ = F ( x)
x2 input space x2¢ feature space
x1 x1¢
N
1 N N
max. � a i - �� a iak yi yk x¢i T x¢k
i=1 2 k =1 i =1
N
subject to 0 £ a i £ C, �a y
i =1
i i =0
f ( x) = � a i yi F( x)T
F(x i ) + b = � a i yi K (x, x i
) + b
i =1 i=1
using w = � a i yi x¢i .
K (u, v ) = � ln fn (u)fn ( v )
n
MLP: tanh(uT v - q )
Fall 2004 Pattern Recognition for Vision
SVM—Polynomial Kernel
¥
K (u - v ) = � lk e j 2p k k u e - j 2p k k v "k k ˛ Z ([ -¥, ¥]d )
k =0
A or B or C or D A B,C,D
A or B C or D B A,C,D
C A,B,D
D A,B,C
A B C D
Training: k
Training: k (k-1) / 2
Classification : k
Classification : k-1
w ,b 2
i=1
D
M1
Leave-one-out Error
vectors.
Smooth
function
Fall 2004 Pattern Recognition for Vision
Regularization—Reproducing Kernel Hilbert Space (RKHS)
Reproducing Kernel Hilbert Space (RKHS) H�
f ( x ) = K ( x, y ), f ( y ) H�
an bn
f ( x), f ( y ) H
”�
n ln ln
f ( x) H�
= f ( x), f ( x) H
= � an2 / ln
n
f ˛H N
i =1
f ˛H N
i =1
w ,b 2
i =1
• SVMs are binary classifiers. Not efficient for problems with large
number of classes.