CHAPITRE V : SUPPORT VECTOR MACHINE (SVM)
This chapter introduces one of the most widely used methods in data mining: SVM. Highly
acclaimed for its performance, this method is often used as a benchmark for comparing
methods.
5.1 INTRODUCTION TO SVM :
SVM (Support Vector Machine) is a binary classification method based on supervised
learning. It was introduced by Cortes and Vapnik in 1995 [12] This method is based on the
search for the existence of a linear data separator in a given space. Since it is a two-class
classification problem, this method uses a training dataset to build the model parameters. It is
based on the use of so-called kernel functions, which allow optimal data separation.
Initially, the method was proposed by its authors for two-class classification. However, it was
later generalised to n classes.
5.2 BASIC CONCEPTS:
For two given classes of examples, the aim of SVM is to find a classifier that will
separate the data and maximise the distance between the two classes. With SVM, this
classifier is a linear classifier called a hyperplane.
In the figure below, we need to find a hyperplane that separates two classes of points. Class 1
is represented by plus symbols. Class 2 is represented by round symbols.
63
Fig 5.1 SVM goal: separate the data
5.2.1 Hyperplane, vector support, marge
The closest points, which are the only ones used to determine the hyperplane, are called
support vectors. They are highlighted in the figure. Clearly, we must have at least one support
vector for each class.
Fig 5.2 Hyperplane and support vectors
64
Obviously, there is a multitude of valid hyperplanes, but the remarkable property of SVMs is
that this hyperplane must be optimal. We are therefore going to look for the hyperplane that
passes "in the middle" of the points of the two classes of examples. Intuitively, this is like
looking for the "safest" hyperplane. If an example has not been described perfectly, a small
variation will not change its classification if its distance from the hyperplane is large.
Formally, this amounts to looking for a hyperplane whose minimum distance to the training
examples is the maximum. This distance is called the "margin" between the hyperplane and
the examples. The optimal separating hyperplane is the one that maximises the margin. As we
are looking to maximise this margin, we will talk about separators with a wide margin.
5.2.1 Why maximise margin?
Intuitively, having a wider margin provides greater security when classifying a new
example. In addition, if we find the classifier that behaves best in relation to the training data,
it is clear that it will also be the one that best classifies the new examples. In figure 5.3, the
right-hand side shows us that with an optimal hyperplane, a new example remains well
classified even though it falls within the margin. On the left, we see that with a smaller
margin, the example is classified poorly.
Hyperplane with low margin Best separating hyperplane
Fig 5.3 SVM objective: find the best separating hyperplane giving the largest margin
In general, the classification of a new unknown example is given by its position relative to the
optimal hyperplane.
5.2.3 Example of SVM use
To illustrate the principle of SVM, consider the following set containing data from two
classes C1 and C2 (table 5.1). Let us do the following tasks :
Represent the data on a plane.
Find approximately the optimal separating hyperplane (straight line).
Give its equation
Consider the new data (7, 4). Will it be classified in C1 or C2?
65
Table 5.1 Dataset for SVM
X Y Classe
1 2 C2
2 4 C2
3 3 C2
4 2 C2
4 4 C2
6 6 C1
8 4 C1
6 8 C1
7 8 C1
5 9 C1
9 9 C1
Figure 5.4 shows the graphical representation of the data. Approximately, we can see that the
vector supports, on the C1 side, are the points (6, 6) and (8, 4). On the C2 side, the course
support is the point (4, 4). The equation of the separator is y = -x +10.
From these considerations, we can conclude that data X will be assigned to class C1.
It is important to note that in this simple example, we have visually designated the vector
supports so that we can easily write down the separator equation. However, as we will see in
the mathematical formulation of the SVM problem, finding the optimal separator equation
requires solving a system of equations.
Fig 5.4 Example of the use of SVM.
66
5.3 LINEARITY AND NON-LINEARITY
SVM models include linearly separable and non-linearly separable cases. The former
are the simplest of SVMs because they allow us to find the linear classifier. But in most real-
world problems there is no linear separation possible between the data, so the maximum
margin classifier cannot be used because it only works if the classes of training data are
linearly separable.
Figure 5.5 summarises this linearity problem. In the figure on the right, the data is linearly
separable and a classifier exists. In the figure on the right, the data is not linearly separable
and a classifier cannot be found directly. The data must be transformed as explained below.
Linear separable case Non-linear separable case
Fig 5.5 Linearly separable vs. non-linearly separable data.
5.3.1 Solving linear cases:
In cases where the data is linearly separable, the solution in SVM becomes a solution
to a linear programming problem.
A linear model corresponds to the following general equation:
f(x) = w.x + b (5.1)
The separating hyperplane (decision frontier) therefore has the equation:
w.x + b = 0 (5.2)
The distance of a point from the plane to the separating hyperplane is given by:
d(x) = |w.x + b| / ||w|| (5.3)
where ||w|| is the norm of the vector w.
67
The optimal hyperplane is the one for which the distance to the nearest points (margin) is
maximum. Let x1 and x2 be points of different classes : f(x1) = +1 and f(x2) = -1.
We can therefore write:
(w.x1) + b = +1 and (w.x2) + b = -1 (5.4)
So, we have:
w.(x1 - x2) = 2 (5.5)
If we divide the two parts of the previous equation by the norm of w, we obtain :
w.(x1 - x2) / ||w|| = 2 / ||w|| (5.6)
We can therefore deduce that maximising the margin is equivalent to minimising ||w|| under
certain constraints:
(5.7)
5.3.2 Solving non-linear cases:
To overcome the drawbacks of non-linearly separable cases, the idea behind SVMs is
to change the data space. The non-linear transformation of the data can allow a linear
separation of the examples in a new space. We therefore have a change of dimension. This
new dimension is called the "re-description space". Intuitively, the higher the dimension of
the re-description space, the higher the probability of finding a separating hyperplane between
the examples. This is illustrated by figure 5.6.
Fig 5.6 Transformation of non-linear separable data to another space.
We therefore have a transformation of a non-linear separation problem in the representation
space into a linear separation problem in a higher-dimensional re-description space. This non-
linear transformation is performed using a kernel function. These include : polynomial,
Gaussian, sigmoid and Laplacian kernels.
Example: Let's take the example of the XOR function, which gives results formed by for non-
separable data (table 5.1).
68
Table 5.1: XOR function giving non-linearly separable data
X Y Class
0 0 0
1 0 1
0 1 1
1 1 0
Figure 5.7 shows the data from the XOR function on a plane. Result 0 (class 0) is represented
by an empty circle. Result 1 (class 1) is represented by a solid circle.
Fig 5.7 XOR function giving non-linear separable data.
The data resulting from the XOR function is not linearly separable. Therefore, we cannot
directly find a linear separator to apply AVM. To do this, we will apply a polynomial
transformation function that will transform the pair (x, y) into a triplet (x, y, x * y). Table 5.2
summarises this transformation.
Table 5.2: Using the polynomial function to transform data
X Y X*Y Class
0 0 0 0
1 0 0 1
0 1 0 1
1 1 1 0
The transformation applied takes us from a 2-dimensional space to a 3-dimensional space.
Let's represent the data from this transformation (figure 5.8). We can see visually that the data
has become linearly separable in the new space. We can now apply SVM.
69
Fig 5.8 Linearly separable data after transformation into a new space.
5.4 SVM MULTI-CLASSES
The SVM method was originally introduced to classify data belonging to two classes.
But later it was generalised to n classes. Figure 5.9 shows a multi-class case.
Fig 5.9 How to use SVM in case of multi-classes data?.
70
There are several ways of adapting two-class SVMs to the multi-class case, including the one-
versus-all approach and the one-versus-one approach. The choice will depend on the size of
the problem.
1. The one versus all approach: consists of training a two-class SVM using the elements
of one class against all the others. This amounts to solving c SVM problems, each of
size n.
2. The one versus one approach: consists of training c(c-1)/2 SVM on each pair of
classes, then deciding on the winning class either by a majority vote or by post-
processing the results using a posteriori probability estimation.
5.5 CRITICISM OF SVM
The main advantages of SVM are:
High-dimensional efficiency: SVMs are efficient in high-dimensional spaces, such as
those encountered in image classification and bioinformatics.
Efficiency over a small number of samples: They continue to perform well even when
the number of samples is relatively small compared with the number of dimensions.
Versatility: By selecting the appropriate kernel, SVMs can be adapted to a wide
variety of problems.
The main disadvantage of SVM is the cost in computation time, especially if the dataset is
large, the number of attributes is high, the data are multi-class and they are not linearly
separable.
CONCLUSION OF THE CHAPTER
In this chapter, we have presented SVM, which is considered to be one of the most powerful
methods in data mining. We have described its principle, which is to search for an optimal
data separator. We explained the transformation that needs to be made (using a kernel
function) when the data is not linearly separable. We end the chapter with the most commonly
used approaches for handling multi-class data. The following are theoretical and practical
exercises on the use of SVM.
71