Machine Learning and Data Mining: Introduction to (Học máy và Khai phá dữ liệu)

Introduction to
Machine Learning and Data Mining

(Học máy và Khai phá dữ liệu)
Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology
2021
2
Contents
¡ Introduction to Machine Learning & Data Mining
¡ Unsupervised learning
¡ Supervised learning
¨ Support Vector Machines
¡ Practical advice
3
Support Vector Machines (1)
¡ Support Vector Machines (SVM) (máy vectơ hỗ trợ) was
proposed by Vapnik and his colleages in 1970s. Then it
became famous and popular in 1990s.
¡ Originally, SVM is a method for linear classification. It finds a
hyperplane (also called linear classifier) to separate the
two classes of data.
¡ For non-linear classification for which no hyperplane
separates well the data, kernel functions (hàm nhân) will be
used.
¨ Kernel functions play the role to transform the data into
another space, in which the data is linearly separable.
¡ Sometimes, we call linear SVM when no kernel function is
used. (in fact, linear SVM uses a linear kernel)
4
Support Vector Machines (2)
¡ SVM has a strong theory that supports its performance.
¡ It can work well with very high dimensional problems.
¡ It is now one of the most popular and strong methods.
¡ For text categorization, linear SVM performs very well.
5
1. SVM: the linearly separable case
¡ Problem representation:
¨ Training data D = {(x1, y1), (x2, y2), …, (xr, yr)} with r instances.
¨ Each xi is a vector in an n-dimensional space,
e.g., xi = (xi1, xi2, …, xin)T. Each dimension represents an attribute.
¨ Bold characters denote vectors.
¨ yi is a class label in {-1; 1}. ‘1’ is possitive class, ‘-1’ is negative class.
¡ Linear separability assumption: there exists a hyperplane

(of linear form) that well separates the two classes
(giả thuyết tồn tại một siêu phẳng mà phân tách 2 lớp được)
6
Linear SVM
¡ SVM finds a hyperplane of the form:
f(x) = áw × xñ + b [Eq.1]
¨ w is the weight vector; b is a real number (bias).
¨ áw × xñ and áw, xñ denote the inner product of two vectors
(tích vô hướng của hai véctơ)
¡ Such that for each xi:

ì 1 if áw × xi ñ + b ³ 0
yi = í [Eq.2]
î- 1 if áw × xi ñ + b < 0
7
Separating hyperplane
¡ The hyperplane (H0) which separates the possitive from
negative class is of the form:
áw × xñ + b = 0
¡ It is also known as the decision boundary/surface.
¡ But there might be infinitely many separating hyperplanes.
Which one should we choose?
[Liu, 2006]
8
Hyperplane with max margin
¡ SVM selects the hyperplane with max margin.
(SVM tìm siêu phẳng tách mà có lề lớn nhất)
¡ It is proven that the max-margin hyperplane has minimal errors
among all possible hyperplanes.
[Liu, 2006]
9
Marginal hyperplanes
¡ Assume that the two classes in our data can be separated
clearly by a hyperplane.
¡ Denote (x+,1) in possitive class and (x-,-1) in negative class
which are closest to the separating hyperplane H0
(áw × xñ + b = 0)
¡ We define two parallel marginal hyperplanes as follows:
¨ H+ crosses x+ and is parallel with H0: áw × x+ñ + b = 1
¨ H- crosses x- and is parallel with H0: áw × x-ñ + b = -1
¨ No data point lies between these two marginal hyperplanes.
And satisfying:
áw × xiñ + b ≥ 1, if yi = 1
áw × xiñ + b ≤ -1, if yi = -1 [Eq.3]
10
The margin (1)
¡ Margin (mức lề) is defined as the distance between the two
marginal hyperplanes.
¨ Denote d+ the distance from H0 to H+.
¨ Denote d- the distance from H0 to H-.
¨ (d+ + d-) is the margin.
¡ Remember that the distance from a point xi to the
hyperplane H0 (áw × xñ + b = 0) is computed as:
| 𝒘 # 𝒙! + 𝑏| [Eq.4]
𝒘
¨ Where:
[Eq.5]
𝒘 = 𝒘#𝒘 = 𝑤"# + 𝑤## + ⋯+ 𝑤$#
11
The margin (2)
¡ So the distance d+ from x+ to H0 is
| áw × x+ ñ + b | | 1 | 1 [Eq.6]
d+ = = =
|| w || || w || || w ||
¡ Similarly, the distance d- from x- to H0 is
| á w × x - ñ + b | | -1 | 1 [Eq.7]
d- = = =
|| w || || w || || w ||
¡ As a result, the margin is:

2
margin = d + + d - = [Eq.8]
|| w ||
12
SVM: learning with max margin (1)
¡ SVM learns a classifier H0 with a maximum margin, i.e., the
hyperplane that has the greatest margin among all
possible hyperplanes.
¡ This learning principle can be formulated as the following

quadratic optimization problem:
¨ Find w and b that maximize
2
margin =
w
¨ and satisfy the below conditions for any training data xi:
ìá w × x i ñ + b ³ 1, if y i = 1
í
îá w × x i ñ + b £ -1, if y i = -1
13
¡ Learning SVM is equivalent to the following minimization
problem:
¨ Minimize á w × wñ [Eq.9]
2
ì á w × x i ñ + b ³ 1, if yi = 1
¨ Conditioned on
í
îá w × x i ñ + b £ -1, if yi = -1
¡ Note, it can be reformulated as:

¨ Minimize á w × wñ [Eq.10]
2 (P)
¨ Conditioned on yi (á w × x i ñ + b) ³ 1, "i = 1..r
¡ This is a constrained optimization problem.
14
Constrained optimization (1)
¡ Consider the problem:
Minimize f(x) conditioned on g(x) = 0
¡ Necessary condition: a solution x0 will satisfy
ì¶
ï ( f(x) + αg (x)) =0
í ¶x x=x0
;
ï g(x) = 0
î
¨ Where α is a Lagrange multiplier.
¡ In the cases of many constraints (gi(x)=0 for i=1…r), a
solution x0 will satisfy:
ì¶ æ r
ö
ï ç
í ¶x è
f(x) + å
i =1
α g
i i (x) ÷
ø x=x0
=0
;
ï g (x) = 0
î i
15
Constrained optimization (2)
¡ Consider the problem with inequality constraints:
Minimize f(x) conditioned on gi(x) ≤ 0
¡ Necessary condition: a solution x0 will satisfy
ì¶ æ r
ö
ï ç f(x) + å αi g i(x) ÷ =0
í ¶x è i =1 ø x=x0 ;
ï g (x) £ 0
î i
¨ Where α! ≥ 0 is a Lagrange multiplier.
r
¡ L = f(x) + å α g (x)
i =1
i i
is known as the Lagrange function.
¨ x is called primal variable (biến gốc)

¨ 𝛼 is called dual variable (biến đối ngẫu)
16
¡ The Lagrange function for problem [Eq. 10] is
r
1
L(w, b, α ) = 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1] [Eq.11a]
2 i=1
¨ Where each α! ≥ 0 is a Lagrange multiplier.
¡ Solving [Eq. 10] is equivalent to the following minimax
problem:
arg min max L(w, b, α ) [Eq.11b]
w,b α ≥0
'1 r *
= arg min max ) 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1],
w,b α ≥0
(2 i=1 +
17
¡ The primal problem [Eq. 10] can be derived by solving:
'1 r *
max L(w, b, α ) = max ) 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1],
α ≥0 α ≥0
(2 i=1 +
¡ Its dual problem (đối ngẫu) can be derived by solving:
&1 r )
min L(w, b, α ) = min ( 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1]+
w,b w,b
'2 i=1 *
¡ It is known that the optimal solution to [Eq. 10] will satisfy

some conditions which is called the Karush-Kuhn-Tucker
(KKT) conditions.
18
SVM: Karush-Kuhn-Tucker
r
∂L [Eq.12]
= w − ∑α i yi x i = 0
∂w i=1
r
∂L [Eq.13]
= −∑α i yi = 0
∂b i=1
yi ( w × xi + b) - 1 ³ 0, "xi (i = 1..r ) [Eq.14]
αi ³ 0 [Eq.15]
αi ( yi ( w × xi + b) - 1) = 0 [Eq.16]
¡ The last equation [Eq. 16] comes from a nice result from the
duality theory.
¨ Note: any 𝛼! > 0 will imply that the associated point xi lies in a
boundary hyperplane (H+ or H-).
¨ Such a boundary point is named as a support vector.
¨ A non-support vector will correspond to 𝛼! = 0.
19
¡ In general, the KKT conditions do not guarantee the
optimality of the solution.
¡ Fortunately, due to the convexity of the primal problem
[Eq.10], the KKT conditions are both necessary and
sufficient to assure the global optimality of the solution. It
means a vector satisfying all KKT conditions provides the
globally optimal classifier.
¨ Convex optimization is ‘easy’ in the sense that we always can
find a good solution with a provable guarantee.
¨ There are many algorithms in the literature, but most are
iterative.
¡ In fact, problem [Eq.10] is pretty hard to derive an efficient
algorithm. Therefore, its dual problem is more preferable.
20
SVM: the dual form (1)
¡ Remember that the dual counterpart of [Eq.10] is
&1 r )
min L(w, b, α ) = min ( 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1]+
w,b w,b
'2 i=1 *
¡ By taking the gradient of L(w,b,𝛼 ) in variables (w,b) and

zeroing it, we can find the following dual function:
r
1 r
LD (α ) = ∑α i − ∑ α iα j yi y j 〈x i ⋅ x j 〉 [Eq.17]
i=1 2 i, j=1
21
SVM: the dual form (2)
¡ Solving problem [Eq.10] is equivalent to solving its dual
problem below:
r
1 r
¨ Maximize LD (α ) = å a i - å a ia j yi y j á x i × x j ñ [Eq.18]
i =1 2 i , j =1
(D)
ìr
Such that ïå a i yi = 0
¨
í i =1
ïîa i ³ 0, "i = 1..r
¡ The constraints in (D) is much more simpler than those of
the primal problem. Therefore deriving an efficient method
to solve this problem might be easier.
¨ However, existing algorithms for this problem are iterative and
complicated. Therefore, we will not discuss any algorithm in
detail !
22
SVM: the optimal classifier
¡ Once the dual problem is solved for 𝜶, we can recover the
optimal solution to problem [Eq.10] by using the KKT.
¡ Let SV be the set of all support vectors
¨ SV is a subset of the training data.
¨ α! > 0 suggests that xi is a support vector.
¡ We can compute w* by using [Eq.12]. So:
r
w* = å a i yi x i = åa y x ; i i i (due to α" = 0 for any xj not in SV)
i =1 x i ÎSV
¡ To find b*, we take an index k such that 𝛼$ > 0:

¨ It means yk(áw* × xkñ + b*) -1 = 0 due to [Eq.16].
¨ Hence,
¨ b* = yk - áw* × xkñ
23
SVM: classifying new instances
¡ The decision boundary is
f(x) = á w * ×xñ + b* = å α y áx × xñ + b* = 0
x i ÎSV
i i i
[Eq.19]
¡ For a new instance z, we compute:

æ ö
sign( á w * ×zñ + b*) = signçç å αi yi á x i × zñ + b* ÷÷ [Eq.20]
è xi ÎSV ø
¨ If the result is 1, z will be assigned to the possitive class;
otherwise z will be assigned to the negative class.
¡ Note that this classification principle
¨ Just depends on the support vectors.
¨ Just needs to compute some dot products.
24
2. Soft-margin SVM
¡ What if the two classes are not linearly separable?
(Trường hợp 2 lớp không thể phân tách tuyến tính thì sao?)
¨ Linear separability is ideal in practice.
¨ Data are often noisy or erronous, making two classes
overlapping (nhiễu/lỗi có thể làm 2 lớp giao nhau)
¡ In the case of linear separability:
¨ Minimize á w × wñ
2
¨ Conditioned on yi (á w × x i ñ + b) ³ 1, "i = 1..r
¡ In the cases of noises or overlapping, those constraints may
never meet simutaneously.
¨ It means we cannot solve for w* and b*.
25
Example of inseparability
¡ Noisy points xa and xb are mis-labeled.
26
Relaxing the constraints
¡ To work with noises/errors, we need to relax the constraints
about margin by using some slack variables xi (³ 0):
(Ta sẽ mở rộng ràng buộc về lề bằng cách thêm biến bù)
áw×xiñ+b ³ 1-xi if yi = 1
áw×xiñ+b £ -1+xi if yi = -1
¨ For a noisy/erronous point xi, we have: xi >1
¨ Otherwise xi = 0.
¡ Therefore, we have the following conditions for the cases of
nonlinear separability:
yi(áw×xiñ + b) ³ 1-xi for all i = 1…r
xi ³ 0 for all i = 1…r
27
Penalty on noises/errors
¡ We should enclose some information on noises/errors into
the objective function when learning
(ta nên đính thêm thông tin về nhiễu/lỗi vào hàm mục tiêu)
¨ Otherwise, the resulting classifier easily overfits the data.
¡ A penalty term will be used so that learning is to minimize
&
𝑾, 𝑾
+ 𝐶 . 𝜉!'
2
!%"
¨ Where C (>0) is the penalty constant (hằng số phạt).
¨ The greater C, the heavier the penalty on noises/errors.
¡ 𝑘 = 1 is often used in practice, due to simplicity for solving
the optimization problem.
28
The new optimization problem
á w × wñ r
¡ Minimize + C å xi [Eq.21]
2 i =1
Conditioned on ì yi (á w × x i ñ + b) ³ 1 - x i , "i = 1..r

í
¨
î x i ³ 0, "i = 1.. r
¡ This problem is called Soft-margin SVM.
¡ It is equivalent to minimize the following function
&
1
. max(0,1 − 𝑦! ( 𝒘 # 𝒙! + 𝑏)) + 𝜆 𝒘 ##
𝑟
!%"
¨ max(0,1 − 𝑦! ( 𝒘 0 𝒙! + 𝑏)) is called Hinge loss
¨ Some popular losses: squared error, cross entropy, hinge
¨ 𝜆 > 0 is a constant
29
The new optimization problem
¡ Its Lagrange function is
r r r
1
L = 〈w ⋅ w〉 + C ∑ξ i − ∑α i [yi (〈w ⋅ x i 〉 + b) −1+ ξ i ]− ∑ µiξ i
2 i=1 i=1 i=1
[Eq.22]
¨ Where ai (³0) and µi (³0) are Lagrange multipliers.
30
Karush-Kuhn-Tucker conditions (1)
¶LP r
= w - å a i yi x i = 0 [Eq.23]
¶w i =1
¶LP r
= - å a i yi = 0 [Eq.24]
¶b i =1
¶LP
= C - a i - µi = 0, "i = 1..r [Eq.25]
¶x i
31
Karush-Kuhn-Tucker conditions (2)
yi (á w × xi ñ + b) - 1 + xi ³ 0, "i = 1..r [Eq.26]
xi ³ 0 [Eq.27]
ai ³ 0 [Eq.28]
µi ³ 0 [Eq.29]
a i ( yi (á w × xi ñ + b) - 1 + xi ) = 0 [Eq.30]
µ ix i = 0 [Eq.31]
32
The dual problem
r
1 r
¡ Maximize LD (α ) = å a i - å a ia j yi y j á x i × x j ñ
i =1 2 i , j =1
ìr
ïå a i yi = 0
¨ Such that [Eq.32]
í i =1
ïî0 £ a i £ C , "i = 1..r
¡ Note that neither x nor µi appears in the dual problem.

¡ This problem is almost similar with that [Eq.18] in the case of
linearly separable classification.
¡ The only difference is the constraint: ai £C
33
Soft-margin SVM: the optimal classifier
¡ Once the dual problem is solved for 𝛼 , we can recover the
optimal solution to problem [Eq.21].
¡ Let SV be the set of all support/noisy vectors
¨ SV is a subset of the training data.
¨ 𝛼! > 0 suggests that xi is a support/noisy vector.
¡ We can compute w* by using [Eq.12]. So:
r
w* = å a i yi x i = åa y x ; i i i (due to 𝛼" = 0 for any xj not in SV)
i =1 x i ÎSV
¡ To find b*, we take an index k such that C > 𝛼$ > 0:

¨ It means xk = 0 due to [Eq.25] and [Eq.31];
¨ And yk(áw* × xkñ + b*) -1 = 0 due to [Eq.30].

¨ Hence, b* = yk - áw* × xkñ
34
Some notes
¡ From [Eq.25-31] we conclude that
If a i = 0 then yi (á w × x i ñ + b ) ³ 1, and x i = 0
If 0 < a i < C then yi (á w × x i ñ + b ) = 1, and x i = 0
If a i = C then yi (á w × x i ñ + b ) < 1, and x i > 0
¡ The classifier can be expressed as a linear combination of

few training points.
¨ Most training points lie outside the margin area: 𝛼! = 0
¨ The support vectors lie in the marginal hyperplanes: 0 < 𝛼! < 𝐶
¨ The noisy/erronous points will associate with 𝛼! = 𝐶
¡ Hence the optimal classifier is a very sparse combination of
the training data.
35
Soft-margin SVM: classifying new instances
¡ The decision boundary is

f(x) = á w * ×xñ + b* = å α y áx × xñ + b* = 0
x i ÎSV
i i i
[Eq.19]
¡ For a new instance z, we compute:

æ ö
sign( á w * ×zñ + b*) = signçç å αi yi á x i × zñ + b* ÷÷ [Eq.20]
è xi ÎSV ø
¨ If the result is 1, z will be assigned to the possitive class;
otherwise z will be assigned to the negative class.
¡ Note: it is important to choose a good value of C, since it
significantly affects performance of SVM.
¨ We often use a validation set to choose a value for C.
36
Linear SVM: summary
¡ Classification is based on a separating hyperplane.
¡ Such a hyperplane is represented as a combination of
some support vectors.
¡ The determination of support vectors reduces to solve a
quadratic programming problem.
¡ In the dual problem and the separating hyperplane, dot
products can be used in place of the original training data.
¨ This is the door for us to learn a nonlinear classifier.
37
3. Non-linear SVM
¡ Consider the case in which our data are not linearly
separable
¨ This may often happen in practice
¡ How about using a non-linear function?
¡ Idea of Non-linear SVM:
¨ Step 1: transform the input into another space, which often has
higher dimensions, so that the projection of data is linearly
separable
¨ Step 2: use linear SVM in
the new space
38
Non-linear SVM
¡ Input space: initial representation of data
¡ Feature space: the new space after the transformation
𝜙(𝒙)
39
Non-linear SVM: transformation
¡ Our idea is to map the input x to a new representation,
using a non-linear mapping
𝜙: 𝑋 ⟶ 𝐹
𝒙 ⟼ 𝜙(𝒙)
¡ In the feature space, the original training data
{(𝒙𝟏, 𝑦1), (𝒙𝟐, 𝑦2), … , (𝒙𝒓, 𝑦& )} are represented by
{(f(x1), y1), (f(x2), y2), …, (f(xr), yr)}
40
Non-linear SVM: transformation
¡ Consider the input space to be 2-dimensional, and we
choose the following map
𝜙: 𝑋 ⟶ 𝐹
(𝑥" , 𝑥# ) ⟼ (𝑥"# , 𝑥## , 2𝑥" 𝑥# )
¡ So instance x = (2, 3) will have the representation in the
feature space as
f(x) = (4, 9, 8.49)
41
Non-linear SVM: learning & prediction
¡ Training problem:
Minimize á w × wñ r
LP = + C å xi [Eq.34]
2 i =1
Such that ì yi (á w × f (x i )ñ + b ) ³ 1 - x i , "i = 1..r
í
î x i ³ 0, "i = 1..r
¡ The dual problem:
r
1 r
Maximize LD = å a i - å a ia j yi y j áf (x i ) × f (x j )ñ [Eq.35]
i =1 2 i , j =1
ì r
Such that ï å a i yi = 0
í i =1
ïî0 £ a i £ C , "i = 1..r
¡ Classifier:
𝑓 𝒛 = 𝒘∗, 𝜙(𝒛) + 𝑏 ∗ = ; 𝛼! 𝑦! 𝜙(𝒙! ), 𝜙(𝒛) + 𝑏 ∗ [Eq.36]
𝒙! ∈+,
42
Non-linear SVM: difficulties
¡ How to find the mapping?
¨ An intractable problem
¡ The curse of dimensionality
¨ As the dimensionality increases, the volume of the space
increases so fast that the available data become sparse.
¨ This sparsity is problematic.
¨ Increasing the dimensionality will require significantly more
training data.
ược
Dữ liệu dù thu thập đ
là
lớn đến đâu thì cũng
g
quá nhỏ so với khôn
gian của chúng
43
Non-linear SVM: Kernel functions
¡ An explicit form of a tranformation is not necessary
¡ The dual problem:
r r
1
Maximize LD = å a i - å a ia j yi y j áf (x i ) × f (x j )ñ
i =1 2 i , j =1
ì r
Such that ï å a i yi = 0
í i =1
ïî0 £ a i £ C , "i = 1..r
¡ Classifier: 𝑓 𝒛 = 𝒘∗ , 𝜙(𝒛) + 𝑏 ∗ = ∑𝒙" ∈+, 𝛼! 𝑦! 𝜙(𝒙! ), 𝜙(𝒛) + 𝑏 ∗
¡ Both require only the inner product áf(x),f(z)ñ

¡ Kernel trick: Nonlinear SVM can be used by replacing those
inner products by evaluations of some kernel function
K(x,z) = áf(x),f(z)ñ [Eq.37]
44
Kernel functions: example
¡ Polynomial
𝐾 𝒙, 𝒛 = 𝒙, 𝒛 %
¡ Consider the polynomial with degree d=2. For any vectors

x=(x1,x2) and z=(z1,z2)
𝒙, 𝒛 # = 𝑥" 𝑧" + 𝑥# 𝑧# #
= 𝑥"# 𝑧"# + 2𝑥" 𝑧" 𝑥# 𝑧# + 𝑥## 𝑧##
= 𝑥"# , 𝑥## , 2𝑥" 𝑥# , 𝑧"# , 𝑧## , 2𝑧" 𝑧#
= 𝜙 𝒙 , 𝜙(𝒛) = 𝐾 𝒙, 𝒛
¨ Where 𝜙 𝒙 = 𝑥-. , 𝑥.. , 2𝑥- 𝑥. .
¡ Therefore the polynomial is the product of two vectors 𝜙 𝒙

and 𝜙(𝒛).
45
Kernel functions: popular choices
¡ Polynomial
K(x,z) = (á x × zñ + θ ) ; trong đó : θ Î R,d Î N

d
¡ Gaussian radial basis function (RBF)

2
x-z
-
K(x,z) = e 2σ
; trong đó : σ > 0
¡ Sigmoid
1
K(x,z) = tanh (b á x × zñ - λ) = - ( b á x× z ñ - λ )
; trong đó : β,λ Î R
1+ e
¡ What conditions ensure a kernel function?

Mercer’s theorem
46
SVM: summary
¡ SVM works with real-value attributes
¨ Any nominal attribute need to be transformed into a real one
¡ The learning formulation of SVM focuses on 2 classes
¨ How about a classification problem with > 2 classes?
¨ One-vs-the-rest, one-vs-one: a multiclass problem can be
solved by reducing to many different problems with 2 classes
¡ The decision function is simple, but may be hard to
interpret
¨ It is more serious if we use some kernel functions
47
SVM: some packages
¡ LibSVM:
•http://www.csie.ntu.edu.tw/~cjlin/libsvm/
¡ Linear SVM for large datasets:

•http://www.csie.ntu.edu.tw/~cjlin/liblinear/
•http://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html
¡ Scikit-learn in python:
•http://scikit-learn.org/stable/modules/svm.html
¡ SVMlight:
•http://www.cs.cornell.edu/people/tj/svm_light/index.html
48
References
¡ B. Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage
Data. Springer, 2006.
¡ C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern
Recognition. Data Mining and Knowledge Discovery, 2(2): 121-167, 1998.
¡ Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks."
Machine learning 20.3 (1995): 273-297.
49
Exercises
¡ What is the main difference between SVM and KNN?
¡ How many support vectors are there in the worst case?
Why?
¡ The meaning of the constant C in SVM? Compare the role

of C in SVM with that of λ in Ridge regression.

Machine Learning and Data Mining: Introduction to (Học máy và Khai phá dữ liệu)

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Machine Learning and Data Mining: Introduction to (Học máy và Khai phá dữ liệu)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning and Data Mining: Introduction to (Học máy và Khai phá dữ liệu)

Uploaded by

Copyright:

Available Formats

Introduction to

Machine Learning and Data Mining

¡ Linear separability assumption: there exists a hyperplane

¡ Such that for each xi:

¡ As a result, the margin is:

¡ This learning principle can be formulated as the following

¡ Note, it can be reformulated as:

¨ x is called primal variable (biến gốc)

¡ Its dual problem (đối ngẫu) can be derived by solving:

¡ It is known that the optimal solution to [Eq. 10] will satisfy

yi ( w × xi + b) - 1 ³ 0, "xi (i = 1..r ) [Eq.14]

¡ By taking the gradient of L(w,b,𝛼 ) in variables (w,b) and

¡ To find b*, we take an index k such that 𝛼$ > 0:

¡ For a new instance z, we compute:

Conditioned on ì yi (á w × x i ñ + b) ³ 1 - x i , "i = 1..r

¡ Note that neither x nor µi appears in the dual problem.

¡ To find b*, we take an index k such that C > 𝛼$ > 0:

¨ And yk(áw* × xkñ + b*) -1 = 0 due to [Eq.30].

¡ The classifier can be expressed as a linear combination of

¡ The decision boundary is

¡ For a new instance z, we compute:

¡ Both require only the inner product áf(x),f(z)ñ

¡ Consider the polynomial with degree d=2. For any vectors

¡ Therefore the polynomial is the product of two vectors 𝜙 𝒙

K(x,z) = (á x × zñ + θ ) ; trong đó : θ Î R,d Î N

¡ Gaussian radial basis function (RBF)

¡ What conditions ensure a kernel function?

¡ Linear SVM for large datasets:

¡ The meaning of the constant C in SVM? Compare the role

You might also like