Support vector machine and
Support vector regression
Support Vector machine
It is a classification technique
H1 does not separate the classes. H2 does, but
only with a small margin. H3 separates them with
the maximal margin.
This maximum-margin separator is determined
by a subset of the datapoints.
Datapoints in this subset are called “support vectors”.
It will be useful computationally if only a small fraction of the datapoints are support
vectors, because we use the support vectors to decide which side of the separator a test
case is on.
Maximum Margin Hyperplane in SVM
• In the case of support vector machines, a data point is viewed as a p-dimensional
vector and we want to know whether we can separate such points with a (p−1)
dimensional hyperplane. This is called linear classifier.
• So we choose the hyperplane so that the distance from it to the nearest data point on
each side is maximized. If such a hyperplane exists, it is known as the maximum-
margin hyperplane and the linear classifier it defines is known as a maximum-margin
classifier.
Linear SVM formulation
𝒘𝒊𝒅𝒕𝒉 = 𝟐/||𝒘|| 𝒏𝒆𝒆𝒅𝒔 𝒕𝒐 𝒃𝒆 𝒎𝒂𝒙𝒊𝒎𝒖𝒎
Or ||𝒘||𝟐 𝒊𝒔 𝒂𝒔 𝒔𝒎𝒂𝒍𝒍 𝒂𝒔 𝒑𝒐𝒔𝒔𝒊𝒃𝒍𝒆
Constraints
𝐰. 𝑥 + 𝑏 > +1 𝑓𝑜𝑟 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑐𝑎𝑠𝑒𝑠
𝐰. 𝑥 + 𝑏 < −1 𝑓𝑜𝑟 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑐𝑎𝑠𝑒𝑠
An optimization Problem!
Linear SVM formulation
• In the context of a linear SVM, the decision boundary can be represented as a hyperplane
defined by the equation ( 𝑤 ⋅ 𝑥 + 𝑏 = 0 ), where ( x ) is the input feature vector.
• ( w ) (Weights or Coefficients): This is a vector of weights associated with the features in
the input data. The weights influence the orientation of the decision boundary. They
determine how much each feature contributes to the decision-making process.
• (Bias or Intercept): This scalar value shifts the decision boundary away from the origin.
Providing additional flexibility for the model to achieve better separation between the
classes.
Hard Margin: If the training data is linearly separable,
we can select two parallel hyperplanes that separate
the two classes of data, so that the distance between
them is as large as possible.
Linear SVM formulation
To extend SVM to cases in which the data are not linearly separable,
the hinge loss function 𝐿 𝑦𝑖 𝑓 𝑥𝑖 is helpful.
𝐿 𝑦𝑖 𝑓 𝑥𝑖 = max 0,1 − 𝑦𝑖 𝑓 𝑥𝑖
The hinge loss penalizes instances that are misclassified or that violate the margin
constraint.
For linear SVM, primal form is fine!
𝐰. 𝑥 + 𝑏 ≥ +1 − 𝜉 𝑖 𝑓𝑜𝑟 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑐𝑎𝑠𝑒𝑠
𝐰. 𝑥 + 𝑏 ≤ −1 + 𝜉 𝑖 𝑓𝑜𝑟 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑐𝑎𝑠𝑒𝑠
𝑤𝑖𝑡ℎ 𝜉 𝑖 ≥ 0 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑖
𝑛
||𝐰||2
𝑎𝑛𝑑 + 𝐶 𝜉 𝑖 𝑎𝑠 𝑠𝑚𝑎𝑙𝑙 𝑎𝑠 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒
2
𝑖=1
𝝃𝒊 − 𝑠𝑙𝑎𝑐𝑘 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
C- is the hyper parameter
Explanation- slack variable
• In SVMs, slack variables 𝜉𝑖 are introduced
to allow for some degree of
misclassification when training the model,
making it possible to create a soft-margin
SVM. The concept of slack variables allows
the SVM to handle cases where classes are
not perfectly separable.
Definition: For an instance (i): 𝝃𝒊 ≥ 𝟎 represents the amount by which the margin is
violated. Essentially, slack variables measure how much an observation fails to meet the
margin constraint.
Physical representation of C
0 < c < 1 the width will expand
c > 1 the width will retract
How algorithm works
For linear SVM, primal form is fine!
𝐰. 𝑥 + 𝑏 ≥ +1 − 𝜉 𝑖 𝑓𝑜𝑟 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑐𝑎𝑠𝑒𝑠
𝐰. 𝑥 + 𝑏 ≤ −1 + 𝜉 𝑖 𝑓𝑜𝑟 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑐𝑎𝑠𝑒𝑠
𝑤𝑖𝑡ℎ 𝜉 𝑖 ≥ 0 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑖
𝑛
||𝐰||2
𝑎𝑛𝑑 + 𝐶 𝜉𝑖 𝑎𝑠 𝑠𝑚𝑎𝑙𝑙 𝑎𝑠 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒
2
𝑖=1
𝑖
One method-Gradient descent
𝜉 − 𝑠𝑙𝑎𝑐𝑘 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
𝒘 += 𝒚𝒊 ∗ 𝒍𝒓 ∗ 𝒙
C- is the hyper parameter
𝒃 += 𝒚𝒊 ∗ 𝒍𝒓
Iteratively evaluate optimum w and b Example python code is attached
The kernel trick Curved in low dimensional
feature space
Planar in high dimensional
feature space
Low-D
xb
• For many mappings from a low-D space to a xa
high-D space, there is a simple operation on
two vectors in the low-D space that can be
used to compute the scalar product of their
two images in the high-D space. High-D
K ( x a , xb ) = ( x a ) . ( xb ) Kernel function is
simply a function of
(x )
a
doing the scalar feature data
Letting the
kernel do product in the ( xb )
obvious way
Different Kernel functions
• Linear Kernel: No mapping is needed as the data is already assumed
to be linearly separable.
• Polynomial Kernel: Maps inputs into a polynomial feature space,
enhancing the classifier’s ability to capture interactions between
features.
• Radial Basis Function (RBF) Kernel: Also known as the Gaussian
kernel, it is useful for capturing complex regions by considering the
distance between points in the input space.
• Sigmoid Kernel: Mimics the behavior of neural networks by using a
sigmoid function as the kernel.
Different Kernel functions
Polynomial: K (x, y ) = (x.y + 1) p
Gaussian Parameters
−||x − y||2 / 2 2 that the user
radial basis K (x, y ) = e must choose
function
Neural net: K (x, y ) = tanh ( k x.y − )
How algorithm works
The kernel trick is fundamentally a feature of the dual formulation of Support Vector
Machines (SVM) and cannot be applied directly to the primal form. Hence SVM based on
dual form is often used. the dual formulation for SVM is derived from the Lagrangian of
the primal problem. 𝑛
𝑛 𝑛
1
maximize 𝑊(𝛼) = 𝜶𝒊 − ා 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐾(𝑥𝑖 , 𝑥𝑗 )
2
𝑖=1 𝑗=1
𝑖=1
Subjected to σ𝑛𝑖=1 𝜶𝒊 𝑦𝑖 = 0 𝑎𝑛𝑑 (0 ≤ 𝜶𝒊 ≤ 𝐶) Remember: 𝐾(𝑥𝑖 , 𝑥𝑘 ) is the
transformation function (or kernel
𝑛 function).
𝑏 = 𝑦𝑘 − 𝛼𝑖 𝑦𝑖 𝐾(𝑥𝑖 , 𝑥𝑘 ) Our aim to evaluate 𝜶𝒊 using
𝑖=1 optimization. All other parameters
Prediction such as b is evaluated using
𝜶𝒊 using training data. Finally,
𝑛 function 𝑓(𝑥) is evaluated.
𝑓(𝑥) = 𝛼𝑖 𝑦𝑖 𝐾(𝑥𝑖 , 𝑥) + 𝑏
𝑖=1
SVM Implementation using Python
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
# Loading Data
# Load dataset
cancer = datasets.load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,
test_size=0.3, random_state=109)
SVM Implementation using Python
# Generating Model
# Create an SVM Classifier using a linear kernel
clf = svm.SVC(kernel='linear')
# Train the model using the training sets
clf.fit(X_train, y_train)
# Predict the response for the test dataset
y_pred = clf.predict(X_test)
# Evaluating the Model
# Model Accuracy: how often is the classifier correct?
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
# Model Precision: what percentage of positive tuples are labeled as such?
precision = metrics.precision_score(y_test, y_pred)
print("Precision:", precision)
# Model Recall: what percentage of positive tuples are labelled as such?
recall = metrics.recall_score(y_test, y_pred)
print("Recall:", recall)
Support vector regression
Support vector regression (SVR) is a
type of support vector machine (SVM)
that is used for regression tasks. It
tries to find a function that best
predicts the continuous output value
for a given input value.
𝒚𝒊 − 𝒘𝑻 𝝓 𝒙𝒊 + 𝒃 ≤ 𝜺 + 𝝃𝒊 for 𝒊 = 𝟏, … , 𝒎
Minimization Problem 𝒚𝒊 − 𝒘𝑻 𝝓 𝒙𝒊 + 𝒃 ≥ 𝜺 + 𝝃𝒊 ∗ for 𝒊 = 𝟏, … , 𝒎
Support vectors
Hyperplane
Decision boundary
In case of NON-LINEAR SVR:
• The kernel functions transform the data into a higher dimensional feature
space to make it possible to perform the linear separation.
Primal formulation
You can evaluate 𝜶 values using iterative mechanism,
applicable for small data sets (pls refer attached code)
Dual formulation (derived from derived from the Lagrangian of the
primal problem)! For large datasets (not important)
𝒏
𝒏 𝒏
∗ 𝟏
[𝒎𝒂𝒙 ∗
(𝜶 𝒊 − 𝜶 𝒊 ) 𝒚 𝒊 − ා 𝑲(𝒙𝒊 , 𝒙𝒋 ) (𝜶𝒊 − 𝜶∗𝒊 )(𝜶𝒋 − 𝜶∗𝒋 )]
𝜶𝒊 ,𝜶𝒊 𝟐
𝒊=𝟏 𝒋=𝟏
𝒊=𝟏
𝒏
[(𝜶𝒊 − 𝜶∗𝒊 ) = 𝟎, 𝜶𝒊 ≥ 𝟎, 𝜶∗𝒊 ≥ 𝟎]
𝒊=𝟏
SVR from scratch (not important!)
import numpy as np # Sample dataset
data = np.array([[1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]])
# [x1, x2, y] X = data[:, 0:2] # Input features (x1, x2)
Y = data[:, 2] # Target variable (y)
# Step 1: Choose a kernel function (linear kernel)
def kernel_function(x, y):
return np.dot(x, y)
# Linear kernel
# Step 2: Compute the kernel matrix K
n = X.shape[0]
K = np.zeros((n, n))
for i in range(n): for j in range(n):
K[i, j] = kernel_function(X[i, :], X[j, :])
print('Kernel Matrix K:') print(K)
SVR from scratch (not important!)
# Step 3: Setting up parameters for SVR
C = 1 # Regularization parameter
epsilon = 0.1 # Epsilon in the epsilon-insensitive loss function
# Placeholder values for Lagrange multipliers α and α*
alpha = np.array([1, 1, 1, 1]) # Placeholder values for Lagrange multipliers α
alpha_star = np.array([0, 0, 0, 0])
# Placeholder values for Lagrange multipliers α* # Calculate the bias term b (here we use
a simple average for demonstration)
b_values = []
for i in support_vector_indices:
b_value = Y[i] - np.sum((alpha - alpha_star) * K[i, :])
b_values.append(b_value)
b = np.mean(b_values) # Average of the computed b values for support vectors
SVR from scratch (not important!)
# Step 4: Compute the regression function f(x)
def regression_function(x):
return np.sum((alpha - alpha_star) * kernel_function(X, x)) + b
# Step 5: Make predictions with the regression function
new_data_point = np.array([2, 3])
# New input data point (x1=2, x2=3)
predicted_y = regression_function(new_data_point)
print(f'Predicted y for input ({new_data_point[0]:.1f},
{new_data_point[1]:.1f}) is: {predicted_y:.2f}')