0% found this document useful (0 votes)

30 views53 pages

Machine Learning Chapter 2

Uploaded by

Rohit Chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views53 pages

Machine Learning Chapter 2

Uploaded by

Rohit Chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Supervised Learning

1. Introduction
Supervised Learning is a machine learning approach where models are trained using
labeled data. It involves mapping input features to known output labels, allowing the model
to learn from examples and make predictions.

2. How Supervised Learning Works

Step 1: Collect a dataset containing input-output pairs (features and labels).

Step 2: Split the dataset into training and testing sets.

Step 3: Train a model using the training dataset.

Step 4: Evaluate the model's accuracy using the testing dataset.

Step 5: Deploy the model for real-world predictions.

3. Types of Supervised Learning

Classification
Classification is used when the output variable is categorical. The model predicts discrete
labels.

Examples:

 Email spam detection (Spam or Not Spam)

 Disease diagnosis (Positive or Negative)
 Sentiment analysis (Positive, Neutral, or Negative)

Regression
Regression is used when the output variable is continuous. The model predicts numerical
values.

Examples:

House price prediction based on size and location

Stock market price prediction

Sales forecasting for businesses

4. Common Supervised Learning Algorithms

Linear Regression – Used for predicting continuous values.

Logistic Regression – Used for binary classification problems.

Decision Trees – Used for both classification and regression tasks.

Random Forest – An ensemble method improving decision tree predictions.

Support Vector Machines (SVM) – Effective for classification tasks.

Neural Networks – Used for complex pattern recognition and deep learning applications.

5. Applications of Supervised Learning

Medical Diagnosis – Detecting diseases based on patient data.

Fraud Detection – Identifying fraudulent transactions in banking systems.

Self-Driving Cars – Recognizing objects and making navigation decisions.

Speech Recognition – Converting spoken words into text.

Recommendation Systems – Suggesting products based on user preferences.

6. Challenges of Supervised Learning

Requires Labeled Data – Collecting and labeling data can be time-consuming and expensive.

Overfitting – Models may perform well on training data but poorly on unseen data.

Computational Cost – Training complex models requires high computational power.

Classification and Its Use Cases

1. Introduction
Classification is a supervised learning technique used in machine learning where the goal is
to categorize data into predefined classes or labels. It is widely used in real-world
applications where decisions need to be made based on input features.

2. How Classification Works

Classification algorithms learn from labeled training data to identify patterns. Once trained,
the model can predict the class of new, unseen data.

Steps in classification:

Data Collection – Gather labeled data for training.

Data Preprocessing – Clean and transform data for analysis.

Model Training – Train a classification algorithm on labeled data.

Evaluation – Test model accuracy using a validation dataset.

Prediction – Use the model to classify new data instances.

3. Types of Classification

Binary Classification
Binary classification deals with two possible outcomes. It is commonly used in decision-
making scenarios where an instance must belong to one of two categories.

Examples:

Email classification: Spam or Not Spam

Loan approval: Approved or Rejected

Disease detection: Positive or Negative

Multi-Class Classification
Multi-class classification deals with more than two categories. Each instance is assigned to
one of multiple predefined classes.

Examples:

Handwritten digit recognition (0-9)

Sentiment analysis (Positive, Neutral, Negative)

Object recognition in images (Car, Bike, Truck, etc.)

Multi-Label Classification
Multi-label classification assigns multiple labels to a single instance. Unlike traditional
classification, an instance can belong to more than one category.

Examples:

Tagging multiple topics in news articles

Identifying multiple objects in an image

Predicting multiple diseases from a patient's symptoms

4. Common Classification Algorithms

Logistic Regression – Used for binary classification.

Decision Tree – Splits data based on feature conditions.

Random Forest – An ensemble method combining multiple decision trees.

Support Vector Machines (SVM) – Finds the optimal boundary between classes.

Naïve Bayes – Based on probability distribution and Bayes' theorem.

K-Nearest Neighbors (KNN) – Classifies based on the nearest data points.

Neural Networks – Used for deep learning and complex patterns.

5. Use Cases of Classification

Healthcare
Disease detection and medical diagnosis (e.g., cancer detection).

Predicting patient readmission risks.

Finance
Fraud detection in banking transactions.

Loan approval decisions based on applicant data.

E-commerce
Product recommendation based on user behavior.

Customer sentiment analysis from reviews.

Social Media
Spam detection in comments and messages.

Automated content moderation and filtering.

Autonomous Vehicles
Object detection for road safety.

Lane detection for self-driving cars.

Decision Tree

1. Introduction
A Decision Tree is a supervised learning algorithm used for classification and regression
tasks. It models decisions based on feature values, following a tree-like structure where
each node represents a decision point and branches lead to possible outcomes.

2. How Decision Trees Work

Decision trees split the dataset into smaller subsets based on feature values. The process
continues until a stopping criterion is met (e.g., maximum depth or pure classification at leaf
nodes).

Steps in Decision Tree construction:

Choose the Best Split – Identify the feature that best splits the data.

Split the Data – Divide the dataset based on feature thresholds.

Repeat for Subsets – Continue splitting until a stopping condition is met.

Assign Output Labels – The final leaf nodes represent classification results.

3. Components of a Decision Tree

Root Node
The starting point of the tree that represents the entire dataset.

Decision Nodes
Intermediate nodes where decisions are made based on feature conditions.

Leaf Nodes
The final nodes that represent predicted output labels.

4. Common Algorithms for Decision Trees

ID3 (Iterative Dichotomiser 3) – Uses entropy and information gain to decide splits.

CART (Classification and Regression Trees) – Uses Gini Index for classification.

CHAID (Chi-square Automatic Interaction Detection) – Uses chi-square tests to split data.

5. Advantages of Decision Trees

Easy to understand and interpret.

Requires little data preprocessing.

Can handle both numerical and categorical data.

Works well for small to medium-sized datasets.

6. Disadvantages of Decision Trees

Prone to overfitting.

Less effective on large datasets without pruning.

Sensitive to small variations in data.

7. Use Cases of Decision Trees

Healthcare
Disease diagnosis and prediction.

Finance
Loan approval and credit risk assessment.

Marketing
Customer segmentation for targeted advertising.
Decision Tree Induction Algorithm:
A Decision Tree is a supervised learning algorithm widely used for classification and
regression tasks. It represents data in a tree structure where internal nodes represent
decision rules based on attributes, branches represent outcomes, and leaf nodes represent
class labels or numerical predictions.

Steps of Decision Tree Induction Algorithm

1. Input:
• A training dataset consisting of multiple attributes and their corresponding class labels.
• A list of attributes { A1, A2, ..., An }.
• A splitting criterion (e.g., Information Gain, Gain Ratio, Gini Index) to determine the best
attribute for partitioning.

2. Check for Base Cases:

• If all instances belong to the same class, return a leaf node with that class label.
• If no attributes are left for splitting, return a leaf node with the majority class label from
the dataset.
• If the dataset is empty, return a default class label (usually the most frequent class in the
original dataset).

3. Select the Best Splitting Attribute:

• Compute a measure of impurity reduction for each attribute using:
- Entropy & Information Gain (ID3 Algorithm)
- Gain Ratio (C4.5 Algorithm)
- Gini Index (CART Algorithm)
• Choose the attribute with the highest impurity reduction as the best splitting criterion.

4. Split the Dataset Based on the Chosen Attribute:

• Partition the dataset into subsets based on unique values of the selected attribute.
• Create a decision node representing the selected attribute.
• Assign each subset to one of the branches of the decision tree.

5. Recursively Build the Tree:

• Apply the same process recursively to each subset until a stopping condition is met:
- All instances in a subset belong to the same class (pure node).
- No further attributes are available for splitting.
- A predefined depth limit is reached to prevent overfitting.

6. Output:
• A fully constructed decision tree that can be used for classification or regression.
• The tree can classify new instances by traversing from the root to the appropriate leaf
node based on attribute values.
Selecting the Best Splitting Attribute
A good split should maximize class separation. Different decision tree algorithms use
different splitting criteria to achieve this, including:
1. Information Gain (ID3 Algorithm): Measures the reduction in entropy after a split.
2. Gain Ratio (C4.5 Algorithm): Adjusts Information Gain by considering the number of
possible splits.
3. Gini Index (CART Algorithm): Measures impurity based on probability distributions.

Entropy & Information Gain (ID3 Algorithm)

The ID3 (Iterative Dichotomiser 3) Algorithm is a decision tree algorithm that uses
Information Gain to determine the best attribute for splitting at each step. It is based on the
concept of Entropy, which measures the impurity in a dataset.

1. Entropy
Entropy measures the impurity or disorder in a dataset. A lower entropy value indicates a
more homogeneous dataset, while a higher entropy value signifies greater disorder.

Formula for Entropy:

Entropy(S) = - ∑ 𝑝𝑖 log₂ 𝑝𝑖
Where:
• S is the dataset
• 𝑝𝑖 is the proportion of class i in S
• The summation runs over all possible classes in the dataset

Example Calculation:
Consider a dataset with 10 instances:
• 6 belong to Class A
• 4 belong to Class B

Entropy(S) = - [(6/10) log₂ (6/10) + (4/10) log₂ (4/10)]

= 0.970 (indicating a relatively high impurity)

2. Information Gain
Information Gain measures how much entropy is reduced after splitting the dataset on a
given attribute. The attribute with the highest Information Gain is chosen for splitting.

Formula for Information Gain:

IG(S, A) = Entropy(S) - ∑ (| 𝑆𝑉 | / |S|) * Entropy(𝑆𝑉 )
Where:
• S is the original dataset
• A is the attribute used for splitting
• 𝑆𝑉 represents subsets created by splitting on attribute A
• | 𝑆𝑉 | / |S| is the proportion of instances in subset 𝑆𝑉
Example Calculation:
If we split on an attribute A that divides S into two subsets:
• Subset 1: 5 instances, Entropy = 0.722
• Subset 2: 5 instances, Entropy = 0.971

IG(S, A) = 0.970 - [(5/10) * 0.722 + (5/10) * 0.971]

= 0.124 (indicating that the attribute reduces entropy by 0.124)

3. ID3 Algorithm Steps

1. Calculate Entropy of the dataset.
2. For each attribute, calculate Information Gain.
3. Select the attribute with the highest Information Gain for splitting.
4. Recursively apply steps 1–3 on each subset until a stopping condition is met (pure nodes
or no remaining attributes).
5. Build the final decision tree.

4. Conclusion
• Entropy measures disorder in a dataset.
• Information Gain selects the best attribute for decision tree splitting.
• The ID3 Algorithm builds a tree by choosing attributes that maximize Information Gain at
each step.

Gain Ratio (C4.5 Algorithm)

The C4.5 Algorithm is an improved version of the ID3 Algorithm. It uses Gain Ratio instead
of Information Gain to address the bias towards attributes with many distinct values. Gain
Ratio normalizes Information Gain by considering the intrinsic information of the attribute.

1. Gain Ratio
Gain Ratio is an improved measure that accounts for the number of branches an attribute
creates. It adjusts Information Gain to prevent bias toward attributes with more distinct
values.

Formula for Gain Ratio:

GainRatio(A) = Information Gain(A) / Split Information(A)
Where:
• A is the attribute being evaluated.
• Information Gain(A) is the reduction in entropy after splitting on A.
• Split Information(A) measures how broadly the attribute splits the dataset.

2. Split Information
Split Information quantifies how evenly an attribute splits the data. It is calculated as
follows:
Formula for Split Information:
SplitInfo(A) = - ∑ (|𝑆𝑉 | / |S|) log₂ (|𝑆𝑉 | / |S|)
Where:
• | 𝑆𝑉 | is the size of each subset after splitting.
• |S| is the total number of instances in the dataset.
• The summation runs over all possible values of attribute A.

3. Example Calculation
Consider splitting a dataset on Attribute A:
• Information Gain(A) = 0.15
• Split Information(A) = 0.8

Gain Ratio(A) = 0.15 / 0.8 = 0.1875

A higher Gain Ratio indicates a better attribute for splitting.

4. Advantages of Gain Ratio

• Prevents bias toward attributes with many distinct values.
• Normalizes Information Gain to ensure fair attribute selection.
• Leads to better decision trees with meaningful splits.

5. Conclusion
• Gain Ratio improves upon Information Gain by considering Split Information.
• It helps in selecting attributes more effectively.
• Used in C4.5 to build efficient decision trees.

Gini Index in the CART Algorithm

Introduction
The Gini Index (or Gini Impurity) is a metric used in the Classification and Regression Tree
(CART) algorithm to measure the impurity of a dataset. It determines how often a randomly
chosen element would be incorrectly classified if it were randomly labeled based on the
distribution of class labels.

Definition of Gini Index

The Gini Index is calculated using the formula:
Gini(D) = 1 - ∑( 𝑝𝑖 )^2
where:
- 𝑝𝑖 is the probability of a data point belonging to class i.
- C is the total number of classes.

A lower Gini Index indicates a purer node, meaning that most instances belong to a single
class.
Gini Index in Decision Trees
The CART algorithm uses the Gini Index to split nodes by selecting the feature that
minimizes the Gini Impurity:
1. Calculate the Gini Impurity for the dataset before splitting.
2. Compute the Gini Impurity for each possible split.
3. Select the split with the lowest weighted Gini Impurity.
4. Repeat until a stopping criterion is met (e.g., pure nodes, max depth, or min samples per
leaf).

Example Calculation
Dataset:

Instance Feature Class

1 A Yes
2 B Yes
3 A No
4 B No
5 A Yes

Step 1: Compute Gini Before Splitting

Classes: Yes (3), No (2)
Gini(D) = 1 - (3/5)^2 - (2/5)^2 = 1 - 0.36 - 0.16 = 0.48

Step 2: Compute Gini for Splitting on Feature

If we split on Feature A vs. B:
- Left Node (A): Yes (2), No (1) → Gini = 0.44
- Right Node (B): Yes (1), No (1) → Gini = 0.50
- Weighted Gini = (3/5 × 0.44) + (2/5 × 0.50) = 0.464

Since 0.464 < 0.48, splitting reduces impurity, so the split is beneficial.

Advantages of Gini Index

- Faster to compute than other impurity measures like entropy.
- Works well with binary splits, as used in CART.
- Provides a clear impurity measure for node splitting.

Disadvantages of Gini Index

- Biased toward multi-class splits, as it tends to favor attributes with more distinct values.
- May require pruning to prevent overfitting.
Random Forest Algorithm

Introduction
Random Forest is a powerful ensemble learning algorithm used for both classification and
regression tasks. It builds multiple decision trees during training and merges their outputs
to improve accuracy and reduce overfitting.

How Random Forest Works

1. Bootstrap Sampling: Random subsets of the training data are selected with replacement
(bagging).
2. Feature Randomness: A subset of features is randomly selected for each decision tree to
introduce diversity.
3. Decision Trees Construction: Each tree is grown independently and predicts an outcome.
4. Aggregation:
- Classification: The majority vote from all trees is the final prediction.
- Regression: The average of all tree predictions is the final result.

Example of Random Forest in Classification

If five decision trees classify a sample as follows:
- Tree 1: Yes
- Tree 2: No
- Tree 3: Yes
- Tree 4: Yes
- Tree 5: No

The final prediction is 'Yes' (majority vote).

Key Parameters in Random Forest

- n_estimators: Number of trees in the forest.
- max_depth: Maximum depth of each tree.
- min_samples_split: Minimum samples needed to split a node.
- max_features: Maximum number of features used for splitting.

Advantages of Random Forest

- High accuracy due to ensemble learning.
- Handles missing values well.
- Prevents overfitting by averaging multiple trees.
- Works well with large datasets.

Disadvantages
- Slower compared to individual decision trees.
- Less interpretability, as multiple trees make it complex.
- Consumes more memory due to multiple trees.
Applications of Random Forest
- Medical Diagnosis (e.g., detecting diseases from patient data).
- Finance (e.g., fraud detection).
- E-commerce (e.g., customer recommendation systems).
- Image Classification (e.g., object recognition).

Conclusion
Random Forest is a highly effective ensemble method that enhances prediction accuracy
and prevents overfitting. It is widely used in various fields due to its robustness and
versatility.

Confusion Matrix

Introduction
A Confusion Matrix is a performance measurement tool used in classification problems. It
helps evaluate the accuracy of a model by comparing actual versus predicted classifications.

Structure of a Confusion Matrix

A typical confusion matrix for a binary classification problem looks like this:

Actual / Predicted Positive (P) Negative (N)

Positive (P) True Positive (TP) False Negative (FN)
Negative (N) False Positive (FP) True Negative (TN)

Key Metrics Derived from Confusion Matrix

1. Accuracy: Measures the overall correctness of the model.
Accuracy = (TP + TN) / (TP + TN + FP + FN)

2. Precision (Positive Predictive Value): Measures how many predicted positives were
actually correct.
Precision = TP / (TP + FP)

3. Recall (Sensitivity or True Positive Rate): Measures how many actual positives were
correctly identified.
Recall = TP / (TP + FN)

4. F1 Score: The harmonic mean of Precision and Recall.

F1 Score = 2 × (Precision × Recall) / (Precision + Recall)
5. Specificity (True Negative Rate): Measures how well the model identifies negative
instances.
Specificity = TN / (TN + FP)

Example Calculation
Suppose we have a binary classification model with the following confusion matrix:

Actual / Predicted Positive (P) Negative (N)

Positive (P) 50 10
Negative (N) 5 35

From this:
- Accuracy = (50 + 35) / (50 + 10 + 5 + 35) = 85%
- Precision = 50 / (50 + 5) = 0.91
- Recall = 50 / (50 + 10) = 0.83
- F1 Score = 0.87

Advantages of Confusion Matrix

- Provides a detailed analysis of model performance.
- Helps identify errors such as false positives and false negatives.
- Works well for imbalanced datasets.

Disadvantages
- Can be difficult to interpret for multi-class problems.
- Doesn't account for cost-sensitive classification.

Conclusion
The Confusion Matrix is a powerful tool for understanding the strengths and weaknesses of
a classification model. It helps determine how well a model predicts different classes and
provides key performance metrics.

Naïve Bayes Algorithm

Introduction
Naïve Bayes is a probabilistic classification algorithm based on Bayes' Theorem. It assumes
that features are independent (hence 'naïve') and is particularly useful for text
classification, spam filtering, and sentiment analysis.

Bayes' Theorem
Bayes' Theorem describes the probability of an event based on prior knowledge of
conditions that might be related to the event. It is given by:
P(A | B) = (P(B | A) × P(A)) / P(B)

Where:
P(A | B) = Posterior Probability (probability of class A given feature B)
P(B | A) = Likelihood (probability of feature B given class A)
P(A) = Prior Probability (probability of class A occurring)
P(B) = Evidence (probability of feature B occurring)

Assumption of Naïve Bayes

The key assumption of Naïve Bayes is conditional independence, meaning that the presence
of one feature does not affect the probability of another feature given the class.

Types of Naïve Bayes Classifiers

1. Gaussian Naïve Bayes (for continuous data, assumes normal distribution).
2. Multinomial Naïve Bayes (for text classification, works with word frequency counts).
3. Bernoulli Naïve Bayes (for binary features, such as spam detection).

Example Calculation
Suppose we want to classify an email as Spam (S) or Not Spam (¬S) based on the presence
of the word 'Offer'.

Given:
P(S) = 0.3 (30% of emails are spam)
P(¬S) = 0.7 (70% of emails are not spam)
P(Offer | S) = 0.8 (80% of spam emails contain 'Offer')
P(Offer | ¬S) = 0.2 (20% of non-spam emails contain 'Offer')

We calculate:
P(S | Offer) = (P(Offer | S) × P(S)) / P(Offer)
P(S | Offer) = (0.8 × 0.3) / ((0.8 × 0.3) + (0.2 × 0.7))
P(S | Offer) = 0.24 / 0.38 = 0.63

Since P(S | Offer) = 0.63 is greater than 0.5, we classify the email as Spam.

Advantages of Naïve Bayes

Fast and efficient for large datasets.
Performs well with high-dimensional data.
Works well for text classification problems.

Disadvantages
Assumes independence, which may not always hold true.
Performs poorly when features are highly correlated.
Less accurate for complex datasets compared to deep learning models.
Applications
Spam Detection (e.g., Gmail spam filters).
Sentiment Analysis (e.g., analyzing customer reviews).
Medical Diagnosis (e.g., predicting diseases from symptoms).
Document Categorization (e.g., classifying news articles).

How Naïve Bayes Works

Introduction
Naïve Bayes is a classification algorithm based on Bayes' Theorem, which calculates the
probability of a class given a set of features. It assumes that all features are independent of
each other (hence 'naïve').

1. Understanding Bayes' Theorem

Bayes’ Theorem is given by:

P(A | B) = (P(B | A) × P(A)) / P(B)

Where:
P(A | B) → Probability of class A (e.g., spam) given feature B (e.g., the word 'offer').
P(B | A) → Probability of feature B occurring in class A.
P(A) → Prior probability of class A (how common spam emails are).
P(B) → Probability of feature B occurring across all data.

2. Applying Bayes’ Theorem to Classification

To classify a new data point (e.g., a new email), Naïve Bayes calculates the probability of
each possible class and picks the one with the highest probability.

For multiple features (x₁, x₂, ..., xₙ ), the probability of a class C is:
P(C | x₁, x₂, ..., xₙ ) ∝ P(x₁ | C) × P(x₂ | C) × ... × P(xₙ | C) × P(C)

3. Example: Spam Email Classification

Let’s classify an email as Spam or Not Spam based on whether it contains the words 'offer'
and 'win'.

Given probabilities based on past emails:

P(Spam) = 0.3, P(Not Spam) = 0.7
P(Offer | Spam) = 0.8, P(Win | Spam) = 0.7
P(Offer | Not Spam) = 0.2, P(Win | Not Spam) = 0.1

Since 0.168 > 0.014, the email is classified as Spam.

4. Key Assumption: Feature Independence

Naïve Bayes assumes that features (e.g., words in an email) are independent of each other.
This is often not true in real life, but the algorithm still works well for many applications,
especially in text classification.

5. Types of Naïve Bayes Models

1. Gaussian Naïve Bayes → Assumes continuous features follow a normal distribution.
2. Multinomial Naïve Bayes → Used for text classification (word frequency counts).
3. Bernoulli Naïve Bayes → Used for binary data (word presence/absence in text).

6. Applications of Naïve Bayes

Spam Detection (e.g., Gmail filters).
Sentiment Analysis (e.g., classifying positive vs. negative reviews).
Medical Diagnosis (e.g., predicting diseases based on symptoms).
Text Classification (e.g., news categorization).

Implementing Naïve Bayes Classifier in Python

Introduction
This document provides a Python implementation of the Naïve Bayes classifier using the Iris
dataset. The classifier is trained and tested using scikit-learn.

Code Implementation
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

# Load the Iris dataset

data = load_iris()
X, y = data.data, data.target

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the Naïve Bayes classifier

nb_classifier = GaussianNB()
# Train the model
nb_classifier.fit(X_train, y_train)

# Make predictions
y_pred = nb_classifier.predict(X_test)

# Evaluate the model

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=data.target_names)

print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:\n", report)

Explanation
1. Load Data: The Iris dataset is used, which contains different flower species.
2. Train-Test Split: The dataset is split into 70% training and 30% testing.
3. Model Training: The Naïve Bayes classifier (`GaussianNB`) is used.
4. Predictions & Evaluation: The model is tested, and its accuracy is measured.

Support Vector Machine (SVM)

Introduction
Support Vector Machine (SVM) is a supervised learning algorithm widely used for
classification and regression tasks. SVM aims to find the optimal hyperplane that best
separates different classes in a dataset. It is particularly effective for high-dimensional data
and is robust against overfitting.

How SVM Works

SVM works by mapping data into a high-dimensional space and finding a hyperplane that
separates the data points into distinct categories. It does this by maximizing the margin
between the closest points from each class, called support vectors.

Hyperplane and Support Vectors

A hyperplane is a decision boundary that separates different classes. In a 2D space, it's a
straight line, while in higher dimensions, it becomes a plane or a more complex shape. The
support vectors are the data points that are closest to the hyperplane, and they determine
its position.

Soft Margin vs. Hard Margin SVM

1. Hard Margin SVM: Used when data is perfectly separable. It requires that all points be
correctly classified without any errors.
2. Soft Margin SVM: Used when data has some overlap. It allows some misclassifications by
introducing a penalty parameter \( C \).

Applications of SVM
SVM is widely used in various fields, including:

- Text Classification (e.g., spam detection, sentiment analysis)

- Image Recognition (e.g., facial recognition, object detection)

- Bioinformatics (e.g., protein classification, gene expression analysis)

Grid Search vs Random Search

Introduction
Hyperparameter tuning is an essential step in machine learning to optimize model
performance. Two popular techniques for this are Grid Search and Random Search. This
document compares both methods and provides an example implementation in Python.

1. Grid Search
Grid Search is an exhaustive search technique where all possible combinations of
hyperparameters are tried and evaluated. It is best suited for small search spaces where
computational cost is manageable.

Advantages of Grid Search

✅Systematic & Exhaustive: It checks all possible combinations, ensuring the best result.
✅Effective for Small Datasets: Works well when the number of hyperparameters is limited.

Disadvantages of Grid Search

❌Computationally Expensive: For large search spaces, training time increases significantly.
❌Inefficient for Large Datasets: Can be impractical due to high time complexity.

2. Random Search
Random Search selects hyperparameter combinations randomly instead of trying all
possibilities. This method is efficient for large search spaces and can quickly find good
parameter combinations.

Advantages of Random Search

✅Faster than Grid Search: It requires fewer computations.
✅Efficient for High-Dimensional Search Spaces: It can find near-optimal solutions quickly.

Disadvantages of Random Search

❌Less Systematic: Since it’s random, it might miss the best combination.
❌Inconsistent Performance: Different runs may give different results.
3. Comparison Table: Grid Search vs Random Search
Feature Grid Search Random Search
Approach Exhaustive search Random selection
Time Complexity High (slow) Lower (faster)
Computational Cost Expensive Less expensive
Efficiency in Large Search Inefficient More efficient
Spaces
Best for Small Datasets? ✅Yes ✅Yes
Best for Large Datasets? ❌No ✅Yes
Guaranteed Best Result? ✅Yes ❌No

4. Example: Implementing Grid Search vs Random Search in Python

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Load the dataset

data = load_iris()
X, y = data.data, data.target

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define hyperparameter grid for Grid Search

param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}

# Grid Search
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(f"Best Parameters (Grid Search): {grid_search.best_params_}")

# Define hyperparameter distributions for Random Search

param_dist = {'C': [0.1, 1, 10, 100], 'kernel': ['linear', 'rbf']}

# Random Search
random_search = RandomizedSearchCV(SVC(), param_distributions=param_dist, n_iter=5,
cv=5, random_state=42)
random_search.fit(X_train, y_train)
print(f"Best Parameters (Random Search): {random_search.best_params_}")
Implementation of Support Vector Machine (SVM) for Classification

Introduction
Support Vector Machine (SVM) is a supervised learning algorithm widely used for
classification tasks. It works by finding the optimal hyperplane that separates different
classes in the dataset. This document provides a step-by-step implementation of SVM for
classification using Python.

1. Dataset and Preprocessing

In this implementation, we use the Iris dataset, which is a well-known dataset for
classification problems. We split the data into training and testing sets and scale the
features for better model performance.

2. Python Implementation of SVM for Classification

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the Iris dataset

iris = datasets.load_iris()
X, y = iris.data, iris.target

# Split dataset into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42,
stratify=y)

# Standardize the features

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train an SVM classifier with an RBF kernel

svm_classifier = SVC(kernel='rbf', C=1.0, gamma='scale')
svm_classifier.fit(X_train, y_train)

# Make predictions
y_pred = svm_classifier.predict(X_test)

# Evaluate the model

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=iris.target_names)

# Print results
print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", report)

# Plot decision boundary (only for 2D data visualization)

def plot_decision_boundary(X, y, model):
h = .02 # Step size in the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3)

plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='k', cmap=plt.cm.Paired)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('SVM Decision Boundary')
plt.show()

# Plot the decision boundary using only the first two features
plot_decision_boundary(X_train[:, :2], y_train, svm_classifier)

3. Explanation of the Code

1. Load Data: We use the Iris dataset for classification.
2. Train-Test Split: The dataset is split into 70% training and 30% testing.
3. Feature Scaling: Standardization is applied to improve model performance.
4. Train SVM Model: An SVM classifier with an RBF kernel is trained.
5. Predictions & Evaluation: Accuracy, confusion matrix, and classification report are
generated.
6. Decision Boundary Visualization: A visualization of the decision boundary is plotted for
the first two features.

Support Vector Machine

● Support vector machines (SVMs) are a set of supervised

learning methods used for classification, regression and
outliers detection.

● SVM is a non-probabilistic linear classifier which directly says to

which group the datapoint belongs to without using any probability
calculation.

Support Vector Machine (SVM) is a supervised learning algorithm used for

classification and regression tasks.

1. It finds an optimal hyperplane that best separates different classes

in a dataset. SVM works by finding a decision boundary (hyperplane)
that maximizes the margin between different classes.
2. The margin is the distance between the closest points of the
classes (support vectors) and the hyperplane.

 The objective of the support vector machine algorithm is to find a

hyperplane in an N-dimensional space(N — the number of
features) that distinctly classifies the data points.

 If the data is linearly separable, SVM finds a straight line (in

2D) or a hyperplane (in higher dimensions).
 If the data is non-linearly separable, SVM uses the kernel trick to
transform data into a higher-dimensional space where it becomes
separable.

The best decision boundary is called a hyperplane.

●
SVM chooses the extreme points/vectors that help in creating the
hyperplane. These extreme cases are called as support vectors, and
hence algorithm is termed as Support Vector Machine.

●
Support Vectors − Datapoints that are closest to the hyperplane is
called support vectors. Separating line will be defined with the help of
these data points.

●
The distance between the vectors and the hyperplane is called as
margin. The goal of SVM is to maximize this margin.
● Hyperplanes can be considered decision boundaries that classify data
points into their respective classes in a multi-dimensional space.

● Optimal hyperplane differentiates the two classes in the best

possible manner.

● A hyperplane is a generalization of a plane:

– in two dimensions, it’s a line.

– in three dimensions, it’s a plane.
in more dimensions, you can call it a hyperplane

The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features, then hyperplane will be a straight line. And
if there are 3 features, then hyperplane will be a 2-dimension plane.

AHyperplane is an n-1 dimensional plane which optimally divides the data of n

dimensions
● Hyperplanes closes to data points have smaller margins.
● The farther a hyperplane is from a data point, the larger its margin will be.
● The optimal hyperplane will be the one with the biggest margin, because a larger margin ensures that slight deviations in
the data points should not affect the outcome of the model.
● SVM is known as a large margin classifier.
● The distance between the line and the closest data points is referred to as the margin.

● The best or optimal line that can separate the two classes is the line that
has the largest margin. This is called the large-margin hyperplane.

● The margin is calculated as the perpendicular distance from the line to only the closest points.

● The objective of the SVM is to find the optimal separating hyperplane

that maximizes the margin of the training data.
Types of SVM

Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into two classes
by using a single straight line, then such data is termed as linearly separable data, and classifier is used called as Linear
SVM classifier.
Non-linear SVM

If a dataset cannot be classified by using a straight line, then such data is termed as non-linear data and classifier
used is called as Non-linear SVM classifier.
.

●
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we cannot
draw a single straight line.
●
So to separate these data points, we need to add one more dimension. For linear data, we have used two
dimensions x and y, so for non-linear data, we will add a third dimension z.
●
SVM will divide the datasets into classes in the following way.

●
●
• We apply transformation and add one more dimension as we call it z-axis.

●
•Lets assume value of points on z plane, z = xZ+ yZ.In this case we can manipulate it as distance of point from z-origin.
Now if we plot in z-axis, a clear separation is visible and a line can be drawn

●
• When we transform back this line to original plane, it maps to circular boundary as shown in image. These
transformations are called kernels.
Common Kernels used with SVM

● Linear Kernel

● Polynomial kernel

● Gaussian Kernal

● RBF - Gaussian Radial Basic Fuction (Default)

● Sigmoid Kernel

● Laplace RBF Kernel

● ANOVARadial Basic Kernel

Cont..

INT354 - Unit 2
No ratings yet
INT354 - Unit 2
26 pages
Classification & Prediction Guide
100% (1)
Classification & Prediction Guide
67 pages
Module 04
No ratings yet
Module 04
75 pages
Supervised Learning Algorithms
No ratings yet
Supervised Learning Algorithms
224 pages
Unit-2 Material
No ratings yet
Unit-2 Material
52 pages
Data Mining Unit 2
No ratings yet
Data Mining Unit 2
40 pages
DWM - Module 3
No ratings yet
DWM - Module 3
22 pages
Dmi Unit 4
No ratings yet
Dmi Unit 4
34 pages
Unit 2
No ratings yet
Unit 2
29 pages
Ml-Unit Iii-1
No ratings yet
Ml-Unit Iii-1
46 pages
Unit 5
No ratings yet
Unit 5
25 pages
Classification
No ratings yet
Classification
23 pages
Machine Learning: Mona Leeza Email: Monaleeza - Bukc@bahria - Edu.pk
No ratings yet
Machine Learning: Mona Leeza Email: Monaleeza - Bukc@bahria - Edu.pk
60 pages
Topic7 Classification
No ratings yet
Topic7 Classification
104 pages
ML Unit 3
No ratings yet
ML Unit 3
13 pages
Lecture 8
No ratings yet
Lecture 8
28 pages
Intro to Supervised Machine Learning
No ratings yet
Intro to Supervised Machine Learning
42 pages
ML Important
No ratings yet
ML Important
11 pages
Classifiction
No ratings yet
Classifiction
42 pages
Classification Notes
No ratings yet
Classification Notes
14 pages
AI Chapter 3 Part 2
No ratings yet
AI Chapter 3 Part 2
51 pages
DM - 06 Mar 2025
No ratings yet
DM - 06 Mar 2025
13 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
No ratings yet
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
129 pages
Chapter - 4
No ratings yet
Chapter - 4
14 pages
CH 5
No ratings yet
CH 5
84 pages
DM Mod 3
No ratings yet
DM Mod 3
14 pages
Classification & Prediction
No ratings yet
Classification & Prediction
24 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
25 pages
Unit 3 &4 BDA Notes
No ratings yet
Unit 3 &4 BDA Notes
20 pages
R20 DMT Unit-Iii
No ratings yet
R20 DMT Unit-Iii
21 pages
Module 3 Notes
No ratings yet
Module 3 Notes
31 pages
Lecture 6 Classification-Decision Tree Rule Based K-NN
No ratings yet
Lecture 6 Classification-Decision Tree Rule Based K-NN
73 pages
ML-Lec-06-Supervised Learning-Decision Trees
No ratings yet
ML-Lec-06-Supervised Learning-Decision Trees
45 pages
Week 4 Part 1 Classification
No ratings yet
Week 4 Part 1 Classification
71 pages
11) Elaborate On The Types of Machine Learning With Appropriate Examples
No ratings yet
11) Elaborate On The Types of Machine Learning With Appropriate Examples
9 pages
Machine Learning Section4 Ebook v03
No ratings yet
Machine Learning Section4 Ebook v03
20 pages
Machine Learning
No ratings yet
Machine Learning
52 pages
Data Mining Unit-Iii
No ratings yet
Data Mining Unit-Iii
36 pages
Module 3 - Machine Learning Algorithms
No ratings yet
Module 3 - Machine Learning Algorithms
17 pages
ABP DWDM UNIT 4 Classification 1
No ratings yet
ABP DWDM UNIT 4 Classification 1
51 pages
What Is Classification? What Is Prediction?
No ratings yet
What Is Classification? What Is Prediction?
36 pages
Data Mining Unit 2
No ratings yet
Data Mining Unit 2
41 pages
AI For Eng Supervised-Learning
No ratings yet
AI For Eng Supervised-Learning
25 pages
Decision Trees and Decision Modeling
No ratings yet
Decision Trees and Decision Modeling
58 pages
4 Classification
No ratings yet
4 Classification
20 pages
Unit-4 Data Mining
No ratings yet
Unit-4 Data Mining
19 pages
Machine Learning - Iii
No ratings yet
Machine Learning - Iii
53 pages
UNIT II Machine Learning
No ratings yet
UNIT II Machine Learning
118 pages
Unit 3
No ratings yet
Unit 3
16 pages
IT 802 ML Unit-2 Notes
No ratings yet
IT 802 ML Unit-2 Notes
19 pages
CCST9017 (2023-24lecture11printed Version) MachineLearning
No ratings yet
CCST9017 (2023-24lecture11printed Version) MachineLearning
55 pages
Classification & Prediction Techniques
No ratings yet
Classification & Prediction Techniques
71 pages
DM Unit 4
No ratings yet
DM Unit 4
24 pages
Classification & Prediction Guide
No ratings yet
Classification & Prediction Guide
83 pages
ML-PPT Unit Iii-1
No ratings yet
ML-PPT Unit Iii-1
38 pages
Classification and Prediction Guide
No ratings yet
Classification and Prediction Guide
93 pages
Tutorial SVM Matlab
100% (1)
Tutorial SVM Matlab
113 pages
Power System Security Assessment and Enhancement - A Bibliographical Survey
No ratings yet
Power System Security Assessment and Enhancement - A Bibliographical Survey
14 pages
Flood Detection
No ratings yet
Flood Detection
28 pages
Final Report (Saie)
No ratings yet
Final Report (Saie)
38 pages
Engineering Students' Fraud Detection Project
No ratings yet
Engineering Students' Fraud Detection Project
61 pages
11 Chapter.4
No ratings yet
11 Chapter.4
26 pages
A Local Experts Organization Model With Application To Face
No ratings yet
A Local Experts Organization Model With Application To Face
16 pages
Icrtmse 2023 Book of Abstracts
No ratings yet
Icrtmse 2023 Book of Abstracts
154 pages
Machine Learning
No ratings yet
Machine Learning
256 pages
Cigre A2 - 105 - 2014
No ratings yet
Cigre A2 - 105 - 2014
8 pages
A Hybrid Approach For Mortality Prediction For Heart Patients Using ACO-HKNN 2020
No ratings yet
A Hybrid Approach For Mortality Prediction For Heart Patients Using ACO-HKNN 2020
8 pages
Aniket PDF
No ratings yet
Aniket PDF
4 pages
ML Assignment-01
No ratings yet
ML Assignment-01
7 pages
Sreekumar and Nizar Banu, 2022
No ratings yet
Sreekumar and Nizar Banu, 2022
11 pages
Project-Final (Team 10) - Project 2
No ratings yet
Project-Final (Team 10) - Project 2
55 pages
BTech. 3rd Year - CSE - Hindi - 2023-24
No ratings yet
BTech. 3rd Year - CSE - Hindi - 2023-24
33 pages
Algorithms For Learning Kernels Based On Centered Alignment: Corinna Cortes
No ratings yet
Algorithms For Learning Kernels Based On Centered Alignment: Corinna Cortes
35 pages
Fig. 5. Double-Sided Switching Pattern in A Cycle Period: Casadei Et Al.: Matrix Converter Modulation Strategies 375
No ratings yet
Fig. 5. Double-Sided Switching Pattern in A Cycle Period: Casadei Et Al.: Matrix Converter Modulation Strategies 375
1 page
Diabetes Disease Prediction Using Significant Attribute Selection and Classification Approach
No ratings yet
Diabetes Disease Prediction Using Significant Attribute Selection and Classification Approach
37 pages
Support Vector Machines (SVMS) - Introduction and Key Concepts
No ratings yet
Support Vector Machines (SVMS) - Introduction and Key Concepts
52 pages
Fake Product Review Detection and Elimination Using Opinion Mining
No ratings yet
Fake Product Review Detection and Elimination Using Opinion Mining
5 pages
CS229: Naive Bayes & SVMs
No ratings yet
CS229: Naive Bayes & SVMs
8 pages
Predicting Length of Stay at Wifi Hotspots
No ratings yet
Predicting Length of Stay at Wifi Hotspots
9 pages
Deep Learning for Steel Quality
No ratings yet
Deep Learning for Steel Quality
10 pages
Python in Robotics and Mechatronics Education
No ratings yet
Python in Robotics and Mechatronics Education
6 pages
Dia Base Paper
No ratings yet
Dia Base Paper
26 pages
L&T Interview
No ratings yet
L&T Interview
14 pages
Merged Presentation Choladeck
No ratings yet
Merged Presentation Choladeck
14 pages
Review of Python Applications in Solving Oil and Gas Problems
No ratings yet
Review of Python Applications in Solving Oil and Gas Problems
11 pages
Certificate Project PDF
No ratings yet
Certificate Project PDF
17 pages