Supervised Learning Algorithm
Unsupervised Learning Algorithms
Machine Learning
Prof. Purvi Patel
purvipatel.it@silveroakuni.ac.in
What is Linear Regression?
• Definition:
• Linear regression is a type of supervised machine learning algorithm that computes the
linear relationship between the dependent variable and one or more independent
features by fitting a linear equation to observed data.
• Types:
• - Simple Linear Regression: One independent variable
• - Multiple Linear Regression: More than one independent variable
• - Univariate Linear Regression: One dependent variable
• - Multivariate Regression: More than one dependent variable
Linear Regression
• Simple Linear Regression:
• Equation: y = β₀ + β₁X
• Variables:
o y: Dependent variable
Types of o
o
X: Independent variable
β₀: Intercept
o β₁: Slope
Linear • Multiple Linear Regression:
• Equation: y = β₀ + β₁X₁ + β₂X₂ + ... + βnXn
Regression • Variables:
o y: Dependent variable
o X₁, X₂, ..., Xn: Independent variables
o β₀: Intercept
o β₁, β₂, ..., βn: Slopes
• Objective:
• To locate the best-fit line minimizing the error
between predicted and actual values
What is the • Best Fit Line: Provides a straight line representing
the relationship between dependent and
Best Fit independent variables
Line? • Slope: Indicates the change in the dependent
variable for a unit change in the independent
variable(s)
• Equation: y = β₀ + β₁X
• Assumption:
• Linear relationship between X (experience) and Y
(salary)
Hypothesis
• Equation: Ŷ = θ₁ + θ₂X
Function in • ŷᵢ = θ₁ + θ₂xᵢ
Linear • Variables:
o yᵢ: True values (dependent variable)
o xᵢ: Input independent training data (independent
Regression variable)
o ŷᵢ: Predicted values
o θ₁: Intercept
o θ₂: Coefficient of x
• Definition:
• Error or difference between predicted value Ŷ and
true value Y
Cost Function • Mean Squared Error (MSE):
• Equation: J(θ) = 1/n Σᵢ=1ⁿ (ŷᵢ - yᵢ)²
for Linear
• Variables:
Regression • - J(θ): Cost function
• - n: Number of data points
• - ŷᵢ: Predicted values
• - yᵢ: Actual values
• Objective:
• Update θ₁ and θ₂ to minimize the error between
Minimizing predicted and true values
• Gradient Descent:
the Cost • Iterative process to update θ₁ and θ₂ based on
gradients calculated from MSE
Function
• Ensures MSE value converges to the global
minima
• Process:
Gradient • Calculate gradients: ∂J(θ)/∂θ₁ and ∂J(θ)/∂θ₂
Descent for • Update parameters:
• θ₁ ← θ₁ - α ∂J(θ)/∂θ₁
Linear • θ₂ ← θ₂ - α ∂J(θ)/∂θ₂
Regression • Learning Rate (α):
• Controls the step size in gradient descent
Data Relations
What is Polynomial Regression?
• Definition:
o Polynomial regression is a type of regression analysis used in statistics and machine learning when the
relationship between the independent variable (input) and the dependent variable (output) is not
linear.
• Non-linear Relationship:
o Allows for more flexibility by fitting a polynomial equation to the data.
Why • Curvilinear Relationships:
o Suitable for relationships that are better represented by
Polynomial a curve rather than a straight line.
Regression? • Capture Non-linear Patterns
• Feature Engineering:
• Add higher-order terms of the dependent
features in the feature space.
How Does • General Form:
Polynomial • Equation: y = β₀ + β₁x + β₂x² + ... + βnxⁿ + ε
• Variables:
Regression •
•
- y: Dependent variable
- x: Independent variable
Work? • - β₀, β₁, ..., βn: Coefficients of the polynomial
terms
• - n: Degree of the polynomial
• - ε: Error term
• Degree (n):
• Crucial aspect of polynomial regression.
Choosing the • Higher Degree:
Polynomial • Allows the model to fit the training data more
closely but may lead to overfitting.
Degree • Complexity:
• Degree should be chosen based on the complexity
of the underlying relationship in the data.
What is • Definition:
• SVM is a supervised machine learning algorithm
Support used for both classification and regression, though
it is best suited for classification.
Vector • Objective:
Machine • Find the optimal hyperplane in an N-dimensional
space that can separate the data points into
different classes.
(SVM)?
• Hyperplane:
• The decision boundary that separates data points
of different classes.
How SVM • Margin:
• The distance between the hyperplane and the
Works nearest data points from each class.
• Maximum Margin Hyperplane:
• The hyperplane with the largest margin, providing
the best separation.
• Outliers:
• Data points that do not fit the general pattern.
SVM with • Soft Margin:
• Allows some misclassifications to handle outliers.
Outliers • Hinge Loss:
• Penalty for misclassified points, proportional to
the distance from the margin.
• Kernel Trick:
• SVM uses kernel functions to map data to a
Non-Linearl higher-dimensional space where it can be linearly
separable.
y Separable • Common Kernels:
• - Linear
Data • - Polynomial
• - Radial Basis Function (RBF)
• - Sigmoid
• Hyperplane:
• The decision boundary that separates data points
of different classes in a feature space.
Support
• Support Vectors:
Vector • The closest data points to the hyperplane, playing
a critical role in defining the hyperplane and
Machine margin.
Terminology • Margin:
• The distance between the support vector and
hyperplane, which SVM aims to maximize.
• Kernel Functions:
SVM Kernel •
•
- Linear: K(w,b) = wᵀx + b
- Polynomial: K(w,x) = (γwᵀx + b)ⁿ
Functions •
•
- Gaussian RBF: K(w,x) = exp(-γ||xi - xj||²)
- Sigmoid: K(xi,xj) = tanh(αxᵢᵀxⱼ + b)
• Advantages:
• - Effective in high-dimensional spaces.
Advantages • - Memory efficient as it uses a subset of training
points (support vectors).
of SVM • - Different kernel functions can be specified for
decision functions, with the option to define
custom kernels.
• Binary Classification:
• Consider a binary classification problem with two
Mathematical classes, labeled as +1 and -1.
Intuition of • Hyperplane Equation:
• wᵀx + b = 0
SVM
• Distance Calculation:
• dᵢ = (wᵀxᵢ + b) / ||w||
• Decision Rule:
• ŷ = 1 if wᵀx + b ≥ 0, else 0
Linear SVM • Optimization:
Classifier • Hard Margin: Minimize (1/2)||w||² subject to yᵢ
(wᵀxᵢ + b) ≥ 1
• Soft Margin: Minimize (1/2)||w||² + C Σᵢ ζᵢ
subject to yᵢ(wᵀxᵢ + b) ≥ 1 - ζᵢ and ζᵢ ≥ 0
Types of • Linear SVM:
• Uses a linear decision boundary to separate data
Support points of different classes.
• Non-Linear SVM:
Vector • Uses kernel functions to handle non-linearly
separable data by transforming it into a
Machines higher-dimensional space.
What is a • A versatile, interpretable algorithm used for
predictive modeling.
• Suitable for both classification and regression
Decision tasks.
• Visual representation of decisions and their
Tree? possible consequences.
• Root Node: Represents the initial feature or
decision.
• Internal Nodes: Test on attributes, leading to
Decision further branching.
• Leaf Nodes: Represent the final decision or
Tree prediction.
• Branches: Indicate the outcomes of decisions.
Structure • Splitting: The process of dividing nodes based on
decision criteria.
• Pruning: Removing unnecessary branches to
improve accuracy.
Decision Tree Approach
Attribute • Information Gain: Measures the change in
entropy after a split.
Selection • Gini Index: Measures the impurity of a node.
Measures
How • Recursive Partitioning: Splitting data based on
attributes.
Decision • Selecting Attributes: Use criteria like Information
Gain or Gini Index.
Trees Are • Stopping Criterion: Max depth or minimum
instances in a leaf node.
Formed
• Interpretability: Easy to understand and visualize.
• Versatility: Handles both numerical and
categorical data.
Advantages • Feature Importance: Provides insights into which
features are most important.
• Handling Missing Data: Decision trees can manage
missing values effectively.
• Overfitting: Decision trees can be prone to
overfitting, especially with small datasets.
• Data Sensitivity: Small changes in data can lead to
Disadvantages a completely different tree.
• Bias: Potential bias in the presence of imbalanced
data.
What is • A powerful ensemble learning technique.
• Combines multiple decision trees to enhance
predictive accuracy.
Random • Originated in 2001 by Leo Breiman.
• Widely used for both classification and regression
Forest? tasks.
• Ensemble of Decision Trees: Multiple trees work
together for a common output.
Fundamental • Randomness in Training: Random subsets of data
and features reduce overfitting.
Concepts • Final Prediction: Aggregation of individual tree
predictions (voting for classification, averaging for
regression).
•
Random • Training Phase: Builds multiple decision trees
using random subsets of data and features.
• Prediction Phase: Aggregates the results from all
Forest trees for final prediction.
• Advantages: Reduces overfitting, improves
Algorithm accuracy, handles complex data.
Ensemble • Concept: Combining multiple models to improve
performance.
• Analogy: Like a team of experts collaborating on a
Learning problem.
• Examples: Random Forest, XGBoost, AdaBoost,
Models LightGBM, Bagging
• Bagging: Training multiple weak models on
Bagging and different data subsets and averaging results.
• Boosting: Sequential training where each model
Boosting corrects the errors of the previous one, with
weighted voting for final prediction.
• Step 1: Select random K data points from the
training set.
How Random • Step 2: Build decision trees for selected subsets.
• Step 3: Choose the number N of decision trees.
• Step 4: Repeat steps 1 and 2 to build the forest.
Forest Works • Step 5: For new data, aggregate predictions from
all trees (majority vote for classification, average
for regression).
Random Forest Approach
• High Predictive Accuracy: Collaborative
decision-making leads to better predictions.
• Resistance to Overfitting: Randomness in training
helps generalize better.
Key Features • Handling Large Datasets: Efficiently manages large
and complex datasets.
of Random • Variable Importance: Identifies and ranks the
most important features.
Forest • Built-in Cross-Validation: Out-of-bag samples used
for internal validation.
• Handling Missing Values: Robust against
incomplete data.
• Parallelization: Trees can be trained
simultaneously, speeding up the process.
• Complexity: More computationally intensive than
Potential single models.
• Interpretability: Less transparent than individual
decision trees.
Drawbacks • Memory Usage: Requires more memory to store
multiple trees.
Unsupervised • Learning from unlabeled data without predefined
categories.
• Focus on discovering patterns and relationships
Learning autonomously.
• Process Overview:
– No explicit guidance or labeled data.
How – The model identifies hidden structures in
the data.
Unsupervised • Example:
Learning Works – Distinguishing between different species
of animals based on traits without prior
labeling.
Key • Pattern Discovery: Models find patterns in data
without labels.
Characteristics of • Clustering: Grouping similar data points together.
• Feature Extraction: Capturing essential
Unsupervised information to differentiate data.
• Label Association: Assigning categories based on
Learning discovered patterns.
Example of • Scenario: Model trained on unlabeled images of
cow, elephant, and camel.
Unsupervised • Identifies and groups images based on similarities,
even without prior knowledge of what a dog or
Learning cat looks like.
Uunsupervised Learning
Types of • Clustering:Grouping similar data points together.
Unsupervised • Association:Identifying patterns and relationships
between items in a dataset.
Learning
• Types of Clustering
– Hierarchical Clustering
– K-means Clustering
– Principal Component Analysis (PCA)
Clustering –
–
Singular Value Decomposition (SVD)
Independent Component Analysis (ICA)
– Gaussian Mixture Models (GMMs)
– Density-Based Spatial Clustering of Applications with
Noise (DBSCAN)
• Definition
– Identifying patterns in data using association rules.
Association Rule • Algorithms
Learning – Apriori Algorithm
– Eclat Algorithm
– FP-Growth Algorithm
• Evaluation Metrics
Evaluating – Silhouette Score
Unsupervised –
–
Calinski-Harabasz Score
Adjusted Rand Index
– Davies-Bouldin Index
Learning Models – F1 Score (adapted for clustering)
• Areas of Application
– Anomaly Detection
Applications of – Scientific Discovery
Unsupervised Learning – Recommendation Systems
– Customer Segmentation
– Image Analysis
• No need for labeled training data.
Advantages of • Effective for dimensionality reduction.
Unsupervised Learning • Capable of finding unknown patterns.
• Provides insights from unlabeled data.
• Hard to measure accuracy due to the lack of
predefined answers.
• Typically lower accuracy compared to supervised
Disadvantages of learning.
Unsupervised Learning • Requires manual interpretation and labeling
post-classification.
• Sensitive to data quality and challenges in performance
evaluation.
Thank You !!