[go: up one dir, main page]

0% found this document useful (0 votes)
58 views49 pages

Machine Learning and Data Mining: Introduction to (Học máy và Khai phá dữ liệu)

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 49

Introduction to

Machine Learning and Data Mining


(Học máy và Khai phá dữ liệu)

Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology

2021
2

Contents
¡ Introduction to Machine Learning & Data Mining
¡ Unsupervised learning
¡ Supervised learning
¨ Support Vector Machines

¡ Practical advice
3
Support Vector Machines (1)
¡ Support Vector Machines (SVM) (máy vectơ hỗ trợ) was
proposed by Vapnik and his colleages in 1970s. Then it
became famous and popular in 1990s.
¡ Originally, SVM is a method for linear classification. It finds a
hyperplane (also called linear classifier) to separate the
two classes of data.
¡ For non-linear classification for which no hyperplane
separates well the data, kernel functions (hàm nhân) will be
used.
¨ Kernel functions play the role to transform the data into
another space, in which the data is linearly separable.
¡ Sometimes, we call linear SVM when no kernel function is
used. (in fact, linear SVM uses a linear kernel)
4
Support Vector Machines (2)
¡ SVM has a strong theory that supports its performance.
¡ It can work well with very high dimensional problems.
¡ It is now one of the most popular and strong methods.
¡ For text categorization, linear SVM performs very well.
5
1. SVM: the linearly separable case
¡ Problem representation:
¨ Training data D = {(x1, y1), (x2, y2), …, (xr, yr)} with r instances.
¨ Each xi is a vector in an n-dimensional space,
e.g., xi = (xi1, xi2, …, xin)T. Each dimension represents an attribute.
¨ Bold characters denote vectors.
¨ yi is a class label in {-1; 1}. ‘1’ is possitive class, ‘-1’ is negative class.

¡ Linear separability assumption: there exists a hyperplane


(of linear form) that well separates the two classes
(giả thuyết tồn tại một siêu phẳng mà phân tách 2 lớp được)
6
Linear SVM
¡ SVM finds a hyperplane of the form:
f(x) = áw × xñ + b [Eq.1]
¨ w is the weight vector; b is a real number (bias).
¨ áw × xñ and áw, xñ denote the inner product of two vectors
(tích vô hướng của hai véctơ)

¡ Such that for each xi:


ì 1 if áw × xi ñ + b ³ 0
yi = í [Eq.2]
î- 1 if áw × xi ñ + b < 0
7
Separating hyperplane
¡ The hyperplane (H0) which separates the possitive from
negative class is of the form:
áw × xñ + b = 0
¡ It is also known as the decision boundary/surface.
¡ But there might be infinitely many separating hyperplanes.
Which one should we choose?

[Liu, 2006]
8
Hyperplane with max margin
¡ SVM selects the hyperplane with max margin.
(SVM tìm siêu phẳng tách mà có lề lớn nhất)
¡ It is proven that the max-margin hyperplane has minimal errors
among all possible hyperplanes.

[Liu, 2006]
9
Marginal hyperplanes
¡ Assume that the two classes in our data can be separated
clearly by a hyperplane.
¡ Denote (x+,1) in possitive class and (x-,-1) in negative class
which are closest to the separating hyperplane H0
(áw × xñ + b = 0)
¡ We define two parallel marginal hyperplanes as follows:
¨ H+ crosses x+ and is parallel with H0: áw × x+ñ + b = 1
¨ H- crosses x- and is parallel with H0: áw × x-ñ + b = -1
¨ No data point lies between these two marginal hyperplanes.
And satisfying:
áw × xiñ + b ≥ 1, if yi = 1
áw × xiñ + b ≤ -1, if yi = -1 [Eq.3]
10
The margin (1)
¡ Margin (mức lề) is defined as the distance between the two
marginal hyperplanes.
¨ Denote d+ the distance from H0 to H+.
¨ Denote d- the distance from H0 to H-.
¨ (d+ + d-) is the margin.
¡ Remember that the distance from a point xi to the
hyperplane H0 (áw × xñ + b = 0) is computed as:

| 𝒘 # 𝒙! + 𝑏| [Eq.4]
𝒘
¨ Where:
[Eq.5]
𝒘 = 𝒘#𝒘 = 𝑤"# + 𝑤## + ⋯+ 𝑤$#
11
The margin (2)
¡ So the distance d+ from x+ to H0 is
| áw × x+ ñ + b | | 1 | 1 [Eq.6]
d+ = = =
|| w || || w || || w ||
¡ Similarly, the distance d- from x- to H0 is
| á w × x - ñ + b | | -1 | 1 [Eq.7]
d- = = =
|| w || || w || || w ||

¡ As a result, the margin is:


2
margin = d + + d - = [Eq.8]
|| w ||
12
SVM: learning with max margin (1)
¡ SVM learns a classifier H0 with a maximum margin, i.e., the
hyperplane that has the greatest margin among all
possible hyperplanes.

¡ This learning principle can be formulated as the following


quadratic optimization problem:
¨ Find w and b that maximize
2
margin =
w
¨ and satisfy the below conditions for any training data xi:
ìá w × x i ñ + b ³ 1, if y i = 1
í
îá w × x i ñ + b £ -1, if y i = -1
13
SVM: learning with max margin (2)
¡ Learning SVM is equivalent to the following minimization
problem:
¨ Minimize á w × wñ [Eq.9]
2
ì á w × x i ñ + b ³ 1, if yi = 1
¨ Conditioned on
í
îá w × x i ñ + b £ -1, if yi = -1

¡ Note, it can be reformulated as:


¨ Minimize á w × wñ [Eq.10]
2 (P)
¨ Conditioned on yi (á w × x i ñ + b) ³ 1, "i = 1..r
¡ This is a constrained optimization problem.
14
Constrained optimization (1)
¡ Consider the problem:
Minimize f(x) conditioned on g(x) = 0
¡ Necessary condition: a solution x0 will satisfy
ì¶
ï ( f(x) + αg (x)) =0
í ¶x x=x0
;
ï g(x) = 0
î
¨ Where α is a Lagrange multiplier.
¡ In the cases of many constraints (gi(x)=0 for i=1…r), a
solution x0 will satisfy:
ì¶ æ r
ö
ï ç
í ¶x è
f(x) + å
i =1
α g
i i (x) ÷
ø x=x0
=0
;
ï g (x) = 0
î i
15
Constrained optimization (2)
¡ Consider the problem with inequality constraints:
Minimize f(x) conditioned on gi(x) ≤ 0
¡ Necessary condition: a solution x0 will satisfy
ì¶ æ r
ö
ï ç f(x) + å αi g i(x) ÷ =0
í ¶x è i =1 ø x=x0 ;
ï g (x) £ 0
î i
¨ Where α! ≥ 0 is a Lagrange multiplier.
r
¡ L = f(x) + å α g (x)
i =1
i i
is known as the Lagrange function.

¨ x is called primal variable (biến gốc)


¨ 𝛼 is called dual variable (biến đối ngẫu)
16
SVM: learning with max margin (3)
¡ The Lagrange function for problem [Eq. 10] is
r
1
L(w, b, α ) = 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1] [Eq.11a]
2 i=1
¨ Where each α! ≥ 0 is a Lagrange multiplier.
¡ Solving [Eq. 10] is equivalent to the following minimax
problem:
arg min max L(w, b, α ) [Eq.11b]
w,b α ≥0

'1 r *
= arg min max ) 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1],
w,b α ≥0
(2 i=1 +
17
SVM: learning with max margin (4)
¡ The primal problem [Eq. 10] can be derived by solving:
'1 r *
max L(w, b, α ) = max ) 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1],
α ≥0 α ≥0
(2 i=1 +

¡ Its dual problem (đối ngẫu) can be derived by solving:

&1 r )
min L(w, b, α ) = min ( 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1]+
w,b w,b
'2 i=1 *

¡ It is known that the optimal solution to [Eq. 10] will satisfy


some conditions which is called the Karush-Kuhn-Tucker
(KKT) conditions.
18
SVM: Karush-Kuhn-Tucker
r
∂L [Eq.12]
= w − ∑α i yi x i = 0
∂w i=1
r
∂L [Eq.13]
= −∑α i yi = 0
∂b i=1

yi ( w × xi + b) - 1 ³ 0, "xi (i = 1..r ) [Eq.14]

αi ³ 0 [Eq.15]

αi ( yi ( w × xi + b) - 1) = 0 [Eq.16]

¡ The last equation [Eq. 16] comes from a nice result from the
duality theory.
¨ Note: any 𝛼! > 0 will imply that the associated point xi lies in a
boundary hyperplane (H+ or H-).
¨ Such a boundary point is named as a support vector.
¨ A non-support vector will correspond to 𝛼! = 0.
19
SVM: learning with max margin (5)
¡ In general, the KKT conditions do not guarantee the
optimality of the solution.
¡ Fortunately, due to the convexity of the primal problem
[Eq.10], the KKT conditions are both necessary and
sufficient to assure the global optimality of the solution. It
means a vector satisfying all KKT conditions provides the
globally optimal classifier.
¨ Convex optimization is ‘easy’ in the sense that we always can
find a good solution with a provable guarantee.
¨ There are many algorithms in the literature, but most are
iterative.
¡ In fact, problem [Eq.10] is pretty hard to derive an efficient
algorithm. Therefore, its dual problem is more preferable.
20
SVM: the dual form (1)
¡ Remember that the dual counterpart of [Eq.10] is
&1 r )
min L(w, b, α ) = min ( 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1]+
w,b w,b
'2 i=1 *

¡ By taking the gradient of L(w,b,𝛼 ) in variables (w,b) and


zeroing it, we can find the following dual function:
r
1 r
LD (α ) = ∑α i − ∑ α iα j yi y j 〈x i ⋅ x j 〉 [Eq.17]
i=1 2 i, j=1
21
SVM: the dual form (2)
¡ Solving problem [Eq.10] is equivalent to solving its dual
problem below:
r
1 r
¨ Maximize LD (α ) = å a i - å a ia j yi y j á x i × x j ñ [Eq.18]
i =1 2 i , j =1
(D)
ìr
Such that ïå a i yi = 0
¨
í i =1
ïîa i ³ 0, "i = 1..r
¡ The constraints in (D) is much more simpler than those of
the primal problem. Therefore deriving an efficient method
to solve this problem might be easier.
¨ However, existing algorithms for this problem are iterative and
complicated. Therefore, we will not discuss any algorithm in
detail !
22
SVM: the optimal classifier
¡ Once the dual problem is solved for 𝜶, we can recover the
optimal solution to problem [Eq.10] by using the KKT.
¡ Let SV be the set of all support vectors
¨ SV is a subset of the training data.
¨ α! > 0 suggests that xi is a support vector.
¡ We can compute w* by using [Eq.12]. So:
r
w* = å a i yi x i = åa y x ; i i i (due to α" = 0 for any xj not in SV)
i =1 x i ÎSV

¡ To find b*, we take an index k such that 𝛼$ > 0:


¨ It means yk(áw* × xkñ + b*) -1 = 0 due to [Eq.16].
¨ Hence,
¨ b* = yk - áw* × xkñ
23
SVM: classifying new instances
¡ The decision boundary is
f(x) = á w * ×xñ + b* = å α y áx × xñ + b* = 0
x i ÎSV
i i i
[Eq.19]

¡ For a new instance z, we compute:


æ ö
sign( á w * ×zñ + b*) = signçç å αi yi á x i × zñ + b* ÷÷ [Eq.20]
è xi ÎSV ø
¨ If the result is 1, z will be assigned to the possitive class;
otherwise z will be assigned to the negative class.
¡ Note that this classification principle
¨ Just depends on the support vectors.
¨ Just needs to compute some dot products.
24
2. Soft-margin SVM
¡ What if the two classes are not linearly separable?
(Trường hợp 2 lớp không thể phân tách tuyến tính thì sao?)
¨ Linear separability is ideal in practice.
¨ Data are often noisy or erronous, making two classes
overlapping (nhiễu/lỗi có thể làm 2 lớp giao nhau)
¡ In the case of linear separability:
¨ Minimize á w × wñ
2
¨ Conditioned on yi (á w × x i ñ + b) ³ 1, "i = 1..r
¡ In the cases of noises or overlapping, those constraints may
never meet simutaneously.
¨ It means we cannot solve for w* and b*.
25
Example of inseparability
¡ Noisy points xa and xb are mis-labeled.
26
Relaxing the constraints
¡ To work with noises/errors, we need to relax the constraints
about margin by using some slack variables xi (³ 0):
(Ta sẽ mở rộng ràng buộc về lề bằng cách thêm biến bù)
áw×xiñ+b ³ 1-xi if yi = 1
áw×xiñ+b £ -1+xi if yi = -1
¨ For a noisy/erronous point xi, we have: xi >1
¨ Otherwise xi = 0.
¡ Therefore, we have the following conditions for the cases of
nonlinear separability:
yi(áw×xiñ + b) ³ 1-xi for all i = 1…r
xi ³ 0 for all i = 1…r
27
Penalty on noises/errors
¡ We should enclose some information on noises/errors into
the objective function when learning
(ta nên đính thêm thông tin về nhiễu/lỗi vào hàm mục tiêu)
¨ Otherwise, the resulting classifier easily overfits the data.
¡ A penalty term will be used so that learning is to minimize
&
𝑾, 𝑾
+ 𝐶 . 𝜉!'
2
!%"
¨ Where C (>0) is the penalty constant (hằng số phạt).
¨ The greater C, the heavier the penalty on noises/errors.
¡ 𝑘 = 1 is often used in practice, due to simplicity for solving
the optimization problem.
28
The new optimization problem
á w × wñ r
¡ Minimize + C å xi [Eq.21]
2 i =1

Conditioned on ì yi (á w × x i ñ + b) ³ 1 - x i , "i = 1..r


í
¨

î x i ³ 0, "i = 1.. r
¡ This problem is called Soft-margin SVM.
¡ It is equivalent to minimize the following function
&
1
. max(0,1 − 𝑦! ( 𝒘 # 𝒙! + 𝑏)) + 𝜆 𝒘 ##
𝑟
!%"
¨ max(0,1 − 𝑦! ( 𝒘 0 𝒙! + 𝑏)) is called Hinge loss
¨ Some popular losses: squared error, cross entropy, hinge
¨ 𝜆 > 0 is a constant
29
The new optimization problem
¡ Its Lagrange function is
r r r
1
L = 〈w ⋅ w〉 + C ∑ξ i − ∑α i [yi (〈w ⋅ x i 〉 + b) −1+ ξ i ]− ∑ µiξ i
2 i=1 i=1 i=1
[Eq.22]
¨ Where ai (³0) and µi (³0) are Lagrange multipliers.
30
Karush-Kuhn-Tucker conditions (1)

¶LP r
= w - å a i yi x i = 0 [Eq.23]
¶w i =1

¶LP r
= - å a i yi = 0 [Eq.24]
¶b i =1

¶LP
= C - a i - µi = 0, "i = 1..r [Eq.25]
¶x i
31
Karush-Kuhn-Tucker conditions (2)
yi (á w × xi ñ + b) - 1 + xi ³ 0, "i = 1..r [Eq.26]

xi ³ 0 [Eq.27]

ai ³ 0 [Eq.28]

µi ³ 0 [Eq.29]

a i ( yi (á w × xi ñ + b) - 1 + xi ) = 0 [Eq.30]

µ ix i = 0 [Eq.31]
32
The dual problem
r
1 r
¡ Maximize LD (α ) = å a i - å a ia j yi y j á x i × x j ñ
i =1 2 i , j =1
ìr
ïå a i yi = 0
¨ Such that [Eq.32]
í i =1
ïî0 £ a i £ C , "i = 1..r

¡ Note that neither x nor µi appears in the dual problem.


¡ This problem is almost similar with that [Eq.18] in the case of
linearly separable classification.
¡ The only difference is the constraint: ai £C
33
Soft-margin SVM: the optimal classifier
¡ Once the dual problem is solved for 𝛼 , we can recover the
optimal solution to problem [Eq.21].
¡ Let SV be the set of all support/noisy vectors
¨ SV is a subset of the training data.
¨ 𝛼! > 0 suggests that xi is a support/noisy vector.
¡ We can compute w* by using [Eq.12]. So:
r
w* = å a i yi x i = åa y x ; i i i (due to 𝛼" = 0 for any xj not in SV)
i =1 x i ÎSV

¡ To find b*, we take an index k such that C > 𝛼$ > 0:


¨ It means xk = 0 due to [Eq.25] and [Eq.31];

¨ And yk(áw* × xkñ + b*) -1 = 0 due to [Eq.30].


¨ Hence, b* = yk - áw* × xkñ
34
Some notes
¡ From [Eq.25-31] we conclude that

If a i = 0 then yi (á w × x i ñ + b ) ³ 1, and x i = 0
If 0 < a i < C then yi (á w × x i ñ + b ) = 1, and x i = 0
If a i = C then yi (á w × x i ñ + b ) < 1, and x i > 0

¡ The classifier can be expressed as a linear combination of


few training points.
¨ Most training points lie outside the margin area: 𝛼! = 0
¨ The support vectors lie in the marginal hyperplanes: 0 < 𝛼! < 𝐶
¨ The noisy/erronous points will associate with 𝛼! = 𝐶
¡ Hence the optimal classifier is a very sparse combination of
the training data.
35
Soft-margin SVM: classifying new instances

¡ The decision boundary is


f(x) = á w * ×xñ + b* = å α y áx × xñ + b* = 0
x i ÎSV
i i i
[Eq.19]

¡ For a new instance z, we compute:


æ ö
sign( á w * ×zñ + b*) = signçç å αi yi á x i × zñ + b* ÷÷ [Eq.20]
è xi ÎSV ø
¨ If the result is 1, z will be assigned to the possitive class;
otherwise z will be assigned to the negative class.
¡ Note: it is important to choose a good value of C, since it
significantly affects performance of SVM.
¨ We often use a validation set to choose a value for C.
36
Linear SVM: summary
¡ Classification is based on a separating hyperplane.
¡ Such a hyperplane is represented as a combination of
some support vectors.
¡ The determination of support vectors reduces to solve a
quadratic programming problem.
¡ In the dual problem and the separating hyperplane, dot
products can be used in place of the original training data.
¨ This is the door for us to learn a nonlinear classifier.
37
3. Non-linear SVM
¡ Consider the case in which our data are not linearly
separable
¨ This may often happen in practice
¡ How about using a non-linear function?
¡ Idea of Non-linear SVM:
¨ Step 1: transform the input into another space, which often has
higher dimensions, so that the projection of data is linearly
separable
¨ Step 2: use linear SVM in
the new space
38
Non-linear SVM
¡ Input space: initial representation of data
¡ Feature space: the new space after the transformation

𝜙(𝒙)
39
Non-linear SVM: transformation
¡ Our idea is to map the input x to a new representation,
using a non-linear mapping
𝜙: 𝑋 ⟶ 𝐹
𝒙 ⟼ 𝜙(𝒙)
¡ In the feature space, the original training data
{(𝒙𝟏, 𝑦1), (𝒙𝟐, 𝑦2), … , (𝒙𝒓, 𝑦& )} are represented by
{(f(x1), y1), (f(x2), y2), …, (f(xr), yr)}
40
Non-linear SVM: transformation
¡ Consider the input space to be 2-dimensional, and we
choose the following map
𝜙: 𝑋 ⟶ 𝐹
(𝑥" , 𝑥# ) ⟼ (𝑥"# , 𝑥## , 2𝑥" 𝑥# )
¡ So instance x = (2, 3) will have the representation in the
feature space as
f(x) = (4, 9, 8.49)
41
Non-linear SVM: learning & prediction
¡ Training problem:
Minimize á w × wñ r
LP = + C å xi [Eq.34]
2 i =1
Such that ì yi (á w × f (x i )ñ + b ) ³ 1 - x i , "i = 1..r
í
î x i ³ 0, "i = 1..r
¡ The dual problem:
r
1 r
Maximize LD = å a i - å a ia j yi y j áf (x i ) × f (x j )ñ [Eq.35]
i =1 2 i , j =1
ì r
Such that ï å a i yi = 0
í i =1
ïî0 £ a i £ C , "i = 1..r
¡ Classifier:
𝑓 𝒛 = 𝒘∗, 𝜙(𝒛) + 𝑏 ∗ = ; 𝛼! 𝑦! 𝜙(𝒙! ), 𝜙(𝒛) + 𝑏 ∗ [Eq.36]
𝒙! ∈+,
42
Non-linear SVM: difficulties
¡ How to find the mapping?
¨ An intractable problem
¡ The curse of dimensionality
¨ As the dimensionality increases, the volume of the space
increases so fast that the available data become sparse.
¨ This sparsity is problematic.
¨ Increasing the dimensionality will require significantly more
training data.

ược
Dữ liệu dù thu thập đ

lớn đến đâu thì cũng
g
quá nhỏ so với khôn
gian của chúng
43
Non-linear SVM: Kernel functions
¡ An explicit form of a tranformation is not necessary
¡ The dual problem:
r r
1
Maximize LD = å a i - å a ia j yi y j áf (x i ) × f (x j )ñ
i =1 2 i , j =1
ì r
Such that ï å a i yi = 0
í i =1
ïî0 £ a i £ C , "i = 1..r
¡ Classifier: 𝑓 𝒛 = 𝒘∗ , 𝜙(𝒛) + 𝑏 ∗ = ∑𝒙" ∈+, 𝛼! 𝑦! 𝜙(𝒙! ), 𝜙(𝒛) + 𝑏 ∗

¡ Both require only the inner product áf(x),f(z)ñ


¡ Kernel trick: Nonlinear SVM can be used by replacing those
inner products by evaluations of some kernel function
K(x,z) = áf(x),f(z)ñ [Eq.37]
44
Kernel functions: example
¡ Polynomial
𝐾 𝒙, 𝒛 = 𝒙, 𝒛 %

¡ Consider the polynomial with degree d=2. For any vectors


x=(x1,x2) and z=(z1,z2)

𝒙, 𝒛 # = 𝑥" 𝑧" + 𝑥# 𝑧# #
= 𝑥"# 𝑧"# + 2𝑥" 𝑧" 𝑥# 𝑧# + 𝑥## 𝑧##
= 𝑥"# , 𝑥## , 2𝑥" 𝑥# , 𝑧"# , 𝑧## , 2𝑧" 𝑧#
= 𝜙 𝒙 , 𝜙(𝒛) = 𝐾 𝒙, 𝒛
¨ Where 𝜙 𝒙 = 𝑥-. , 𝑥.. , 2𝑥- 𝑥. .

¡ Therefore the polynomial is the product of two vectors 𝜙 𝒙


and 𝜙(𝒛).
45
Kernel functions: popular choices
¡ Polynomial

K(x,z) = (á x × zñ + θ ) ; trong đó : θ Î R,d Î N


d

¡ Gaussian radial basis function (RBF)


2
x-z
-
K(x,z) = e 2σ
; trong đó : σ > 0

¡ Sigmoid
1
K(x,z) = tanh (b á x × zñ - λ) = - ( b á x× z ñ - λ )
; trong đó : β,λ Î R
1+ e

¡ What conditions ensure a kernel function?


Mercer’s theorem
46
SVM: summary
¡ SVM works with real-value attributes
¨ Any nominal attribute need to be transformed into a real one
¡ The learning formulation of SVM focuses on 2 classes
¨ How about a classification problem with > 2 classes?
¨ One-vs-the-rest, one-vs-one: a multiclass problem can be
solved by reducing to many different problems with 2 classes
¡ The decision function is simple, but may be hard to
interpret
¨ It is more serious if we use some kernel functions
47
SVM: some packages
¡ LibSVM:
•http://www.csie.ntu.edu.tw/~cjlin/libsvm/

¡ Linear SVM for large datasets:


•http://www.csie.ntu.edu.tw/~cjlin/liblinear/
•http://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html

¡ Scikit-learn in python:
•http://scikit-learn.org/stable/modules/svm.html

¡ SVMlight:
•http://www.cs.cornell.edu/people/tj/svm_light/index.html
48
References
¡ B. Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage
Data. Springer, 2006.
¡ C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern
Recognition. Data Mining and Knowledge Discovery, 2(2): 121-167, 1998.
¡ Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks."
Machine learning 20.3 (1995): 273-297.
49
Exercises
¡ What is the main difference between SVM and KNN?
¡ How many support vectors are there in the worst case?
Why?

¡ The meaning of the constant C in SVM? Compare the role


of C in SVM with that of λ in Ridge regression.

You might also like