Machine Learning and Data Mining: Introduction to (Học máy và Khai phá dữ liệu)
Machine Learning and Data Mining: Introduction to (Học máy và Khai phá dữ liệu)
Machine Learning and Data Mining: Introduction to (Học máy và Khai phá dữ liệu)
Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology
2021
2
Contents
¡ Introduction to Machine Learning & Data Mining
¡ Unsupervised learning
¡ Supervised learning
¨ Support Vector Machines
¡ Practical advice
3
Support Vector Machines (1)
¡ Support Vector Machines (SVM) (máy vectơ hỗ trợ) was
proposed by Vapnik and his colleages in 1970s. Then it
became famous and popular in 1990s.
¡ Originally, SVM is a method for linear classification. It finds a
hyperplane (also called linear classifier) to separate the
two classes of data.
¡ For non-linear classification for which no hyperplane
separates well the data, kernel functions (hàm nhân) will be
used.
¨ Kernel functions play the role to transform the data into
another space, in which the data is linearly separable.
¡ Sometimes, we call linear SVM when no kernel function is
used. (in fact, linear SVM uses a linear kernel)
4
Support Vector Machines (2)
¡ SVM has a strong theory that supports its performance.
¡ It can work well with very high dimensional problems.
¡ It is now one of the most popular and strong methods.
¡ For text categorization, linear SVM performs very well.
5
1. SVM: the linearly separable case
¡ Problem representation:
¨ Training data D = {(x1, y1), (x2, y2), …, (xr, yr)} with r instances.
¨ Each xi is a vector in an n-dimensional space,
e.g., xi = (xi1, xi2, …, xin)T. Each dimension represents an attribute.
¨ Bold characters denote vectors.
¨ yi is a class label in {-1; 1}. ‘1’ is possitive class, ‘-1’ is negative class.
[Liu, 2006]
8
Hyperplane with max margin
¡ SVM selects the hyperplane with max margin.
(SVM tìm siêu phẳng tách mà có lề lớn nhất)
¡ It is proven that the max-margin hyperplane has minimal errors
among all possible hyperplanes.
[Liu, 2006]
9
Marginal hyperplanes
¡ Assume that the two classes in our data can be separated
clearly by a hyperplane.
¡ Denote (x+,1) in possitive class and (x-,-1) in negative class
which are closest to the separating hyperplane H0
(áw × xñ + b = 0)
¡ We define two parallel marginal hyperplanes as follows:
¨ H+ crosses x+ and is parallel with H0: áw × x+ñ + b = 1
¨ H- crosses x- and is parallel with H0: áw × x-ñ + b = -1
¨ No data point lies between these two marginal hyperplanes.
And satisfying:
áw × xiñ + b ≥ 1, if yi = 1
áw × xiñ + b ≤ -1, if yi = -1 [Eq.3]
10
The margin (1)
¡ Margin (mức lề) is defined as the distance between the two
marginal hyperplanes.
¨ Denote d+ the distance from H0 to H+.
¨ Denote d- the distance from H0 to H-.
¨ (d+ + d-) is the margin.
¡ Remember that the distance from a point xi to the
hyperplane H0 (áw × xñ + b = 0) is computed as:
| 𝒘 # 𝒙! + 𝑏| [Eq.4]
𝒘
¨ Where:
[Eq.5]
𝒘 = 𝒘#𝒘 = 𝑤"# + 𝑤## + ⋯+ 𝑤$#
11
The margin (2)
¡ So the distance d+ from x+ to H0 is
| áw × x+ ñ + b | | 1 | 1 [Eq.6]
d+ = = =
|| w || || w || || w ||
¡ Similarly, the distance d- from x- to H0 is
| á w × x - ñ + b | | -1 | 1 [Eq.7]
d- = = =
|| w || || w || || w ||
'1 r *
= arg min max ) 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1],
w,b α ≥0
(2 i=1 +
17
SVM: learning with max margin (4)
¡ The primal problem [Eq. 10] can be derived by solving:
'1 r *
max L(w, b, α ) = max ) 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1],
α ≥0 α ≥0
(2 i=1 +
&1 r )
min L(w, b, α ) = min ( 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1]+
w,b w,b
'2 i=1 *
αi ³ 0 [Eq.15]
αi ( yi ( w × xi + b) - 1) = 0 [Eq.16]
¡ The last equation [Eq. 16] comes from a nice result from the
duality theory.
¨ Note: any 𝛼! > 0 will imply that the associated point xi lies in a
boundary hyperplane (H+ or H-).
¨ Such a boundary point is named as a support vector.
¨ A non-support vector will correspond to 𝛼! = 0.
19
SVM: learning with max margin (5)
¡ In general, the KKT conditions do not guarantee the
optimality of the solution.
¡ Fortunately, due to the convexity of the primal problem
[Eq.10], the KKT conditions are both necessary and
sufficient to assure the global optimality of the solution. It
means a vector satisfying all KKT conditions provides the
globally optimal classifier.
¨ Convex optimization is ‘easy’ in the sense that we always can
find a good solution with a provable guarantee.
¨ There are many algorithms in the literature, but most are
iterative.
¡ In fact, problem [Eq.10] is pretty hard to derive an efficient
algorithm. Therefore, its dual problem is more preferable.
20
SVM: the dual form (1)
¡ Remember that the dual counterpart of [Eq.10] is
&1 r )
min L(w, b, α ) = min ( 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1]+
w,b w,b
'2 i=1 *
î x i ³ 0, "i = 1.. r
¡ This problem is called Soft-margin SVM.
¡ It is equivalent to minimize the following function
&
1
. max(0,1 − 𝑦! ( 𝒘 # 𝒙! + 𝑏)) + 𝜆 𝒘 ##
𝑟
!%"
¨ max(0,1 − 𝑦! ( 𝒘 0 𝒙! + 𝑏)) is called Hinge loss
¨ Some popular losses: squared error, cross entropy, hinge
¨ 𝜆 > 0 is a constant
29
The new optimization problem
¡ Its Lagrange function is
r r r
1
L = 〈w ⋅ w〉 + C ∑ξ i − ∑α i [yi (〈w ⋅ x i 〉 + b) −1+ ξ i ]− ∑ µiξ i
2 i=1 i=1 i=1
[Eq.22]
¨ Where ai (³0) and µi (³0) are Lagrange multipliers.
30
Karush-Kuhn-Tucker conditions (1)
¶LP r
= w - å a i yi x i = 0 [Eq.23]
¶w i =1
¶LP r
= - å a i yi = 0 [Eq.24]
¶b i =1
¶LP
= C - a i - µi = 0, "i = 1..r [Eq.25]
¶x i
31
Karush-Kuhn-Tucker conditions (2)
yi (á w × xi ñ + b) - 1 + xi ³ 0, "i = 1..r [Eq.26]
xi ³ 0 [Eq.27]
ai ³ 0 [Eq.28]
µi ³ 0 [Eq.29]
a i ( yi (á w × xi ñ + b) - 1 + xi ) = 0 [Eq.30]
µ ix i = 0 [Eq.31]
32
The dual problem
r
1 r
¡ Maximize LD (α ) = å a i - å a ia j yi y j á x i × x j ñ
i =1 2 i , j =1
ìr
ïå a i yi = 0
¨ Such that [Eq.32]
í i =1
ïî0 £ a i £ C , "i = 1..r
If a i = 0 then yi (á w × x i ñ + b ) ³ 1, and x i = 0
If 0 < a i < C then yi (á w × x i ñ + b ) = 1, and x i = 0
If a i = C then yi (á w × x i ñ + b ) < 1, and x i > 0
𝜙(𝒙)
39
Non-linear SVM: transformation
¡ Our idea is to map the input x to a new representation,
using a non-linear mapping
𝜙: 𝑋 ⟶ 𝐹
𝒙 ⟼ 𝜙(𝒙)
¡ In the feature space, the original training data
{(𝒙𝟏, 𝑦1), (𝒙𝟐, 𝑦2), … , (𝒙𝒓, 𝑦& )} are represented by
{(f(x1), y1), (f(x2), y2), …, (f(xr), yr)}
40
Non-linear SVM: transformation
¡ Consider the input space to be 2-dimensional, and we
choose the following map
𝜙: 𝑋 ⟶ 𝐹
(𝑥" , 𝑥# ) ⟼ (𝑥"# , 𝑥## , 2𝑥" 𝑥# )
¡ So instance x = (2, 3) will have the representation in the
feature space as
f(x) = (4, 9, 8.49)
41
Non-linear SVM: learning & prediction
¡ Training problem:
Minimize á w × wñ r
LP = + C å xi [Eq.34]
2 i =1
Such that ì yi (á w × f (x i )ñ + b ) ³ 1 - x i , "i = 1..r
í
î x i ³ 0, "i = 1..r
¡ The dual problem:
r
1 r
Maximize LD = å a i - å a ia j yi y j áf (x i ) × f (x j )ñ [Eq.35]
i =1 2 i , j =1
ì r
Such that ï å a i yi = 0
í i =1
ïî0 £ a i £ C , "i = 1..r
¡ Classifier:
𝑓 𝒛 = 𝒘∗, 𝜙(𝒛) + 𝑏 ∗ = ; 𝛼! 𝑦! 𝜙(𝒙! ), 𝜙(𝒛) + 𝑏 ∗ [Eq.36]
𝒙! ∈+,
42
Non-linear SVM: difficulties
¡ How to find the mapping?
¨ An intractable problem
¡ The curse of dimensionality
¨ As the dimensionality increases, the volume of the space
increases so fast that the available data become sparse.
¨ This sparsity is problematic.
¨ Increasing the dimensionality will require significantly more
training data.
ược
Dữ liệu dù thu thập đ
là
lớn đến đâu thì cũng
g
quá nhỏ so với khôn
gian của chúng
43
Non-linear SVM: Kernel functions
¡ An explicit form of a tranformation is not necessary
¡ The dual problem:
r r
1
Maximize LD = å a i - å a ia j yi y j áf (x i ) × f (x j )ñ
i =1 2 i , j =1
ì r
Such that ï å a i yi = 0
í i =1
ïî0 £ a i £ C , "i = 1..r
¡ Classifier: 𝑓 𝒛 = 𝒘∗ , 𝜙(𝒛) + 𝑏 ∗ = ∑𝒙" ∈+, 𝛼! 𝑦! 𝜙(𝒙! ), 𝜙(𝒛) + 𝑏 ∗
𝒙, 𝒛 # = 𝑥" 𝑧" + 𝑥# 𝑧# #
= 𝑥"# 𝑧"# + 2𝑥" 𝑧" 𝑥# 𝑧# + 𝑥## 𝑧##
= 𝑥"# , 𝑥## , 2𝑥" 𝑥# , 𝑧"# , 𝑧## , 2𝑧" 𝑧#
= 𝜙 𝒙 , 𝜙(𝒛) = 𝐾 𝒙, 𝒛
¨ Where 𝜙 𝒙 = 𝑥-. , 𝑥.. , 2𝑥- 𝑥. .
¡ Sigmoid
1
K(x,z) = tanh (b á x × zñ - λ) = - ( b á x× z ñ - λ )
; trong đó : β,λ Î R
1+ e
¡ Scikit-learn in python:
•http://scikit-learn.org/stable/modules/svm.html
¡ SVMlight:
•http://www.cs.cornell.edu/people/tj/svm_light/index.html
48
References
¡ B. Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage
Data. Springer, 2006.
¡ C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern
Recognition. Data Mining and Knowledge Discovery, 2(2): 121-167, 1998.
¡ Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks."
Machine learning 20.3 (1995): 273-297.
49
Exercises
¡ What is the main difference between SVM and KNN?
¡ How many support vectors are there in the worst case?
Why?