[go: up one dir, main page]

0% found this document useful (0 votes)
15 views12 pages

Machine Learning

Machine Learning (ML) is a subset of Artificial Intelligence that enables computers to learn from data and make predictions without explicit programming. The ML process involves steps such as data collection, preprocessing, model selection, training, evaluation, and deployment, with various algorithms like neural networks, linear regression, and decision trees used for different tasks. Applications of ML span across multiple fields including finance, healthcare, and retail, enhancing decision-making and automating processes.

Uploaded by

Marzia Borno
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views12 pages

Machine Learning

Machine Learning (ML) is a subset of Artificial Intelligence that enables computers to learn from data and make predictions without explicit programming. The ML process involves steps such as data collection, preprocessing, model selection, training, evaluation, and deployment, with various algorithms like neural networks, linear regression, and decision trees used for different tasks. Applications of ML span across multiple fields including finance, healthcare, and retail, enhancing decision-making and automating processes.

Uploaded by

Marzia Borno
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Machine Learning (ML) is a branch Evaluating the model − When Neural Networks − It works like the

of Artificial Intelligence (AI) that module is trained, the model has to human brain with many connected
works on algorithm developments be tested on new data that they nodes. They help to find patterns
and statistical models that allow haven't been able to see during and are used in language
computers to learn from data and training. processing, image and speech
make predictions or decisions recognition, and creating images.
without being explicitly Hyperparameter Tuning and
programmed. Optimization − After evaluating the Linear Regression − It predicts
model, you may need to adjust its numbers based on past data. For
How does Machine Learning hyperparameters to make it more example, it helps estimate house
Work? efficient. prices in an area.

Machine Learning process includes Predictions and Deployment − Logistic Regression − It predicts like
Project Setup, Data Preparation, When the model has been "yes/no" answers and it is useful for
Modeling and Deployment. The programmed and optimized, it will spam detection and quality control.
following figure demonstrates the be ready to estimate new data. This
common working process of is done by adding new data to the Clustering − It is used to group
Machine Learning. It follows some model and using its output for similar data without instructions
set of steps to do the task decision-making or other analysis. and it helps to find patterns that
humans might miss.
Sequential Process flow of Types of Machine Learning
Machine Learning Decision Trees − They help to
1. Supervised Machine Learning − classify data and predict numbers
Data Collection − Data collection is It is a type of machine learning that using a tree-like structure. They are
an initial step in the process of trains the model using labeled easy to check and understand.
machine learning. In this stage, it datasets to predict outcomes.
collects data from the different Random forests − They combine
sources such as databases, text 2. Unsupervised Machine Learning multiple decision trees to improve
files, pictures, sound files, or web − It is a type of machine learning predictions.
scraping. that learns patterns and structures
within the data without human Importance of Machine Learning
Data Pre-processing − It is a key supervision.
Data Processing − Machine learning
step in the process of machine
3. Semi-supervised Learning − It is is useful to analyze large data from
learning, which involves deleting
a type of machine learning that is social media, sensors, and other
duplicate data, fixing errors,
neither fully supervised nor fully sources and help to reveal patterns
managing missing data either by
unsupervised. The semi-supervised and insights to improve decision-
eliminating or filling it in, and
learning algorithms basically fall making.
adjusting and formatting the data.
between supervised and
Data-Driven Insights − Machine
Choosing the Right Model − The unsupervised learning methods.
learning algorithms find trends and
next step is to select a machine
4. Reinforcement Machine connections in big data that
learning model; once data is
Learning − It is a type of machine humans might miss, which helps to
prepared, then we apply it to ML
learning model that is similar to take better decisions and
models like linear regression,
supervised learning but does not predictions.
decision trees, and neural networks
that may be selected to implement. use sample data to train the
Automation − Machine learning
algorithm. This model learns by trial
automates the repetitive tasks,
Training the Model − This step and error.
reducing errors and saving time.
includes training the model from
the data so it can make better Several machine learning
Personalization − Machine learning
predictions. algorithms are commonly used.
is useful to analyze the user
These include:
preferences to provide
personalized recommendations in websites and social media, handling complex structure of algorithms
e-commerce, social media, and FAQs, giving recommendations, and developed similar to the human
streaming services. assisting in e-commerce. brain.

Predictive Analytics − Machine Computer Vision − It helps The effectiveness of deep learning
learning models use past data to computers in analyzing the images models for complex problems is
predict future outcomes, which and videos to take action. more compared to machine
may help for sales forecasts, risk learning models.
management, and demand Recommendation Engines − ML
planning. recommendation engines suggest Machine Learning Vs. Generative
products, movies, or content based AI
Pattern Recognition − Machine on user behavior.
learning is useful in pattern Machine learning and Generative AI
recognition during image Robotic Process Automation (RPA) are different branches with
processing, speech recognition, and − RPA uses AI to automate different applications. While
natural language processing. repetitive tasks and reduce manual Machine Learning is used for
work. predictive analysis and decision-
Finance − Machine learning is used making, Generative AI focuses on
in credit scoring, fraud detection, Automated Stock Trading − AI- creating content, including realistic
and algorithmic trading. driven trading platforms make rapid images and videos in existing
trades to optimize stock portfolios patterns.
Retail − Machine learning helps to without human intervention.
enhance the recommendation What is Machine Learning Life
systems, supply chain How does Machine Learning work? Cycle?
management, and customer
Decision Process − Based on the The machine learning life cycle is an
service.
input data and output labels iterative process that moves from a
Fraud Detection & Cybersecurity − provided to the model, it will business problem to a machine
Machine learning detects the produce a logic about the pattern learning solution.
fraudulent transactions and identified.
security threats in real time. ML Life Cycle
Cost Function − It is the measure of
Continuous Improvement − error between expected value and Problem Definition
Machine learning models update predicted value.
The first step in the machine
regularly with new data, which
Optimization Process − Cost learning life cycle is to identify the
allows them to adapt and improve
function can be minimized by problem you want to solve. As this
over time.
adjusting the weights at the step lays the foundation for
Applications of Machine Learning training stage. building a machine learning model,
the problem definition has to be
Machine learning is used in various Machine Learning Vs. Deep clear and concise.This stage
fields. Some of the most common Learning involves understanding the
applications include: business problem, defining the
Deep learning is a sub-field of
problem statement, and identifying
Speech Recognition − Machine Machine learning. The actual
the success criteria for the machine
learning is used to convert spoken difference between these is the
learning model.
language into text using natural way the algorithm learns.
language processing (NLP). Data Preparation
In Machine learning, computers
Customer Service − There are learn from large datasets using Data preparation is a process to
several chatbots that are useful for algorithms to perform tasks like prepare data for analysis by
reducing human interaction and prediction and recommendation. performing data exploration,
providing better support on Whereas Deep learning uses a feature engineering, and feature
selection. Data exploration involves accuracy and performance of the 1. Model Selection
visualizing and understanding the machine learning model. Issues that
data, while feature engineering have to be addressed are missing Model selection is a crucial step in
involves creating new features from values, duplicate data, invalid data the machine learning workflow. The
the existing data. Data preparation and noise. decision of choosing a model
process includes collecting data, depends on basic features like
preprocessing data, and feature 3. Analyzing Data characteristics of the data,
engineering & feature selection. complexity of the problem, desired
After the data is all sorted, it is time outcomes and how well it aligns
Let's discuss each step involved in
to understand the data that is with the defined problem.
the data preparation phase of
collected. The data is visualized and
machine learning life cycle process
statistically summarized to gain 2. Model Training
1. Data Collection insights.Various tools like Power BI,
Tableau are used to visualize data In this process, the algorithm is fed
After the problem statement is which helps in understanding the with a preprocessed dataset to
analyzed, the next step would be patterns and trends in the data. identify and understand the
collecting data. This involves patterns and relationships in the
gathering data from various sources 4. Feature Engineering and specified features.Consistent
which is given as a raw material to Selection training of a model by adjusting
the machine learning model. Few parameters would improve the
A 'Feature' is an individual prediction rate and enhance
features that are considered while
measurable quantity which is accuracy.
collecting data are −
preferably observed when the
Relevant and usefulness − The data
machine learning model is being 3. Model Evaluation
collected has to be relevant to the
trained. Feature Engineering is the
problem statement, and also
process of creating new features or In model evaluation, the
should be useful enough to train
enhancing the existing ones to performance of the machine
the machine learning model
accurately understand the patterns learning model is evaluated using a
efficiently.
and trends in the data.Feature set of evaluation metrics. These
selection involves the process of metrics measure the accuracy,
Quality and Quantity − The quality
picking up features that are precision, recall, and F1 score of the
and quantity of the data collected
consistent and more relevant to the model.
would directly impact the
performance of the machine problem statement.
Model Deployment
learning model.
Model Development
In the model deployment phase, we
Variety − Make sure that the data
In the model development phase, deploy the machine learning model
collected is diverse so that the
the machine learning model is built into production. This process
model can be trained with multiple
using the prepared data. The model involves integrating the tested
scenarios to recognize the patterns.
building process involves selecting model with existing systems to
the appropriate machine learning make it available to users,
There are various sources from
algorithm, algorithm training, management or other purposes.
where the data can be collected like
tuning the hyperparameters of the This also involves testing the model
surveys, existing databases, and
algorithm, and evaluating the in a real-world scenario.Two
online platforms like Kaggle.
performance of the model using important factors that have to be
2. Data Preprocessing cross-validation techniques.This checked before deploying are
phase mainly consists of three whether the model is portable i.e,
The data collected often might be steps, model selection, model the ability to transfer the software
unstructured and messy which training, and model evaluation. from one machine to another and
causes it to negatively affect the Let's discuss these three steps in scalable i.e, the model need not be
outcomes, hence pre processing detail − redesigned to maintain
data is important to improve the performance.
3. Minimizing the Error: The Least β₁ is the slope, indicating how much
Squares Method y changes for each unit change in x.
Monitor and Maintenance
To find the best-fit line, we use a For multiple linear regression (with
Monitoring in machine learning method called Least Squares. The more than one independent
involves techniques to measure the idea behind this method is to variable), the hypothesis function
model performance metrics and to minimize the sum of squared expands to:
detect issues in the differences between the actual
models.Sometimes when the issue values (data points) and the h(x₁,x₂,...,xₖ)=β₀+β₁x₁+β₂x₂+...+βₖxₖh
detected in the designed model predicted values from the line.
cannot be solved with training it Where: x₁,x₂,...,xₖ are the
These differences are called
with new data, the issue becomes independent variables.
residuals.The formula for residuals
the problem statement. is:Residual=yᵢ−y^ᵢ
β₀ is the intercept.
Best Fit Line in Linear Regression Where:
β₁,β₂,...,βₖ are the coefficients,
In linear regression, the best-fit line representing the influence of each
yᵢ is the actual observed value
is the straight line that most respective independent variable on
accurately represents the y^ ᵢ is the predicted value from the the predicted output.
relationship between the line for that
independent variable (input) and
the dependent variable (output). It The least squares method
Types of Linear Regression
is the line that minimizes the minimizes the sum of the squared
difference between the actual data residuals: Sum of squared
1. Simple Linear Regression
points and the predicted values errors(SSE)=Σ(yᵢ−y^ᵢ)²
from the model. Simple linear regression is used
This method ensures that the line
when we want to predict a target
2. Equation of the Best-Fit Line best represents the data where the
value (dependent variable) using
sum of the squared differences
only one input feature
For simple linear regression (with between the predicted values and
(independent variable). It assumes
one independent variable), the actual values is as small as possible.
a straight-line relationship between
best-fit line is represented by the
the two.
equation y=mx+b Where: Hypothesis function in Linear
Regression
Formula: y^=θ0+θ1x
y is the predicted value (dependent
variable) In linear regression, the hypothesis
Where:
function is the equation used to
x is the input (independent make predictions about the y^ is the predicted value
variable) dependent variable based on the
independent variables.For a simple x is the input (independent
m is the slope of the line (how case with one independent variable)
much y changes when x changes) variable, the hypothesis function is:
θ 0 is the intercept (value of y^
b is the intercept (the value of y h(x)=β₀+β₁x Where: when x=0) θ1is the slope or
when x = 0) coefficient (how much y^ changes
h(x)(or y^ ) is the predicted value of with one unit of x) . Example:
The best-fit line will be the one that the dependent variable (y).x is the Predicting a person’s salary (y)
optimizes the values of m (slope) independent variable. based on their years of experience
and b (intercept) so that the
(x).
predicted y values are as close as β₀ is the intercept, representing the
possible to the actual data points. value of y when x is 0.

2. Multiple Linear Regression


Multiple linear regression involves the MSE value converges to the Instances of b: 5
more than one independent global minima, signifying the most Instances of a: 3
variable and one dependent accurate fit of the linear regression
Entropy H(X)=[(38)log⁡238+(58)log⁡
variable. The equation for multiple line to the dataset. 258]=−[0.375(−1.415)+0.625(−0.67
linear regression is: 8)]=−
Decision Tree in Machine Learning (−0.53−0.424)=0.954Entropy H(
y^=θ 0+θ 1x 1+θ 2x 2+⋯+θ nx n X)=[(83)log283+(85)log285
A decision tree is a supervised ]=−[0.375(−1.415)+0.625(−0.678)]=
where: y^ is the predicted learning algorithm used for both 2. Gini Index
Gini Index is a metric to measure
valuex1,x2,…,xn classification and regression tasks.
how often a randomly chosen
It has a hierarchical tree structure element would be incorrectly
x 1,x 2,…,x nare the independent which consists of a root node, identified. It means an attribute with
variables θ1,θ2,…,θn are the branches, internal nodes and leaf a lower Gini index should be
coefficients (weights) nodes. It It works like a flowchart preferred.
corresponding to each predictor.θ0 Formula for Gini Index is given by :
help to make decisions step by step Gini=1−∑i=1npi2Gini=1−∑i=1n
is the intercept. where: pi2
The goal of the algorithm is to find Internal nodes represent attribute
the best Fit Line equation that can tests
predict the values based on the Types of Decision Tree Algorithms
independent variables. Branches represent attribute values
There are six different decision tree
Cost function for Linear Regression Leaf nodes represent final decisions algorithms as shown in diagram are
or predictions. listed below.
As we have discussed earlier about
best fit line in linear regression, its Information Gain and Gini Index in 1. ID3 (Iterative Dichotomiser
3)
not easy to get it easily in real life Decision Tree
ID3 is a classic decision tree
cases so we need to calculate errors algorithm commonly used for
that affects it. 1. Information Gain
classification tasks. It works by
greedily choosing the feature
In Linear Regression, the Mean Information Gain tells us how that maximizes the information
Squared Error (MSE) cost function is useful a question (or feature) is for gain at each node.
employed, which calculates the splitting data into groups. It Entropy: It measures impurity in
measures how much the the dataset. Denoted by H(D) for
average of the squared errors dataset D is calculated using the
between the predicted values uncertainty decreases after the formula:
split. A good question will create H(D)=Σi=1npilog2(pi)H(D)=Σi=1
y^i and the actual values yi . The clearer groups and the feature with npilog2(pi)
purpose is to determine the the highest Information Gain is Information gain: It quantifies
the reduction in entropy after
optimal values for the intercept θ1 chosen to make the decision.
splitting the dataset on a
and the coefficient of the input feature:
Gain(S,A)=Entropy(S)−∑ vA∣S∣/∣Sv
−Σv=1V∣Dv∣∣D∣H(Dv)Informatio
feature θ2 providing the best-fit InformationGain=H(D)

nGain=H(D)−Σv=1V∣D∣∣Dv∣
line for the given data points. The |.Entropy(S v)
linear equation expressing this
Entropy: is the measure of H(Dv)
relationship is y^I =θ1+θ2 x i .
uncertainty of a random variable it
MSE function can be calculated as: characterizes the impurity of an
arbitrary collection of examples. 2. C4.5
C4.5 uses a modified version of
Cost function(J)=1n∑ni(yi^−yi)2 The higher the entropy more the
information gain called the gain
information content. ratio to reduce the bias towards
Utilizing the MSE function, the features with values.which measures
iterative process of gradient Example: the amount of data required to
descent is applied to update the For the set X = {a,a,a,b,b,b,b,b} describe an attribute values:
Total instances: 8
values of \1&θ 2 . This ensures that
GainRatio=SplitgainGaininformatio  Hinge Loss: A loss function
nGainRatio=Gaininformation Support Vector Machine (SVM) penalizing misclassified points
Splitn or margin violations and is
Algorithm
combined with regularization
Support Vector Machine (SVM) in SVM.
3. CART (Classification and is a supervised machine learning  Dual Problem: Involves
Regression Trees) algorithm used for classification solving for Lagrange
CART is a widely used decision and regression tasks. It tries to multipliers associated with
tree algorithm that is used for find the best boundary known as support vectors, facilitating
classification and regression hyperplane that separates the kernel trick and efficient
tasks.The feature that minimizes different classes in the data. It is computation.
the Gini impurity is selected for useful when you want to do Mathematical Computation of
splitting at each node. The binary classification like spam SVM
formula is: vs. not spam or cat vs. dog. Consider a binary classification
Gini(D)=1−Σi=1npi2Gini(D)=1− The main goal of SVM is to problem with two classes,
Σi=1npi2 maximize the margin between labeled as +1 and -1. We have a
where pipi is the probability of the two classes. The larger the training dataset consisting of
class ii in dataset DD. margin the better the model input feature vectors X and their
performs on new and unseen corresponding class labels Y.
4. CHAID (Chi-Square data. The equation for the linear
Automatic Interaction Key Concepts of Support hyperplane can be written as:
Vector Machine wTx+b=0wTx+b=0
Detection)
 Hyperplane: A decision Where:
CHAID uses chi-square tests to boundary separating different
determine the best splits  ww is the normal vector to the
classes in feature space and hyperplane (the direction
especially for categorical is represented by the
variables. This approach is perpendicular to it).
equation wx + b = 0 in linear  bb is the offset or bias term
particularly useful for analyzing classification.
large datasets with many representing the distance of
 Support Vectors: The closest the hyperplane from the origin
categorical features. The Chi- data points to the hyperplane,
Square Statistic formula: along the normal vector ww.
crucial for determining the Distance from a Data Point to
X2=Σ(Oi−Ei)2EiX2=ΣEi(Oi−Ei)2 hyperplane and margin in
Where: the Hyperplane
SVM.
 OiOi represents the observed  Margin: The distance The distance between a data
frequency between the hyperplane and point xixiand the decision
 EiEi represents the expected the support vectors. SVM boundary can be calculated as:
frequency in each category. aims to maximize this margin di=wTxi+b∣∣w∣∣di=∣∣w∣∣wTxi+b
for better classification where ||w|| represents the
performance. Euclidean norm of the weight
5. MARS (Multivariate  Kernel: A function that maps vector w.
Adaptive Regression Splines) data to a higher-dimensional Linear SVM Classifier
MARS is an extension of the space enabling SVM to Distance from a Data Point to
CART algorithm. It uses splines handle non-linearly separable the Hyperplane:
to model non-linear relationships data. y^={1: wTx+b≥00: wTx+b <0y^
between variables.  Hard Margin: A maximum- ={10: wTx+b≥0: wTx+b <0
Basis Functions: Each basis margin hyperplane that Where y^y^ is the predicted
function in MARS is a simple perfectly separates the data label of a data point.
linear function defined over a without misclassifications. Optimization Problem for SVM
range of the predictor variable.  Soft Margin: Allows some For a linearly separable dataset
The function is described as: misclassifications by the goal is to find the hyperplane
h(x)={x−tifx>tt−xifx≤t}h(x)={x− introducing slack variables, that maximizes the margin
tifx>tt−xifx≤t} balancing margin between the two classes while
Where maximization and ensuring that all data points are
 xx is a predictor variable misclassification penalties correctly classified. This leads to
 ttis the knot function. when data is not perfectly the following optimization
Knot Function: The knots are separable. problem:
the points where the piecewise  C: A regularization term minimizew,b12∥w∥2w,bminimize
linear functions connect. MARS balancing margin 21∥w∥2
places these knots to best maximization and Subject to the constraint:

⋯,myi(wTxi+b)≥1fori=1,2,3,
represent the data's non-linear misclassification penalties. A yi(wTxi+b)≥1fori=1,2,3,

⋯,m
structure. higher C value forces stricter
penalty for misclassifications.
Where: kernel allows SVM to handle Advantages of Support Vector
 yiyi is the class label (+1 or - non-linear classification Machine (SVM)
1) for each training instance. problems by mapping data 1. High-Dimensional
 xixi is the feature vector for into a higher-dimensional Performance: SVM excels in
the ii-th training instance. space. high-dimensional spaces,
 mm is the total number of The dual formulation optimizes making it suitable for image
training instances. the Lagrange multipliers αiαi classification and gene
The condition yi(wTxi+b)≥1yi and the support vectors are expression analysis.
(wTxi+b)≥1 ensures that each those training samples 2. Nonlinear Capability:
data point is correctly classified where αi>0αi>0. Utilizing kernel functions like
and lies outside the margin. SVM Decision Boundary RBF and polynomial SVM
Soft Margin in Linear SVM Once the dual problem is effectively handles nonlinear
solved, the decision boundary is relationships.
Classifier
given by: 3. Outlier Resilience: The soft
In the presence of outliers or margin feature allows SVM to
non-separable data the SVM w=∑i=1mαitiK(xi,x)
+bw=∑i=1mαitiK(xi,x)+b ignore outliers, enhancing
allows some misclassification by robustness in spam detection
introducing slack variables ζiζi. Where ww is the weight
vector, xx is the test data point and anomaly detection.
The optimization problem is 4. Binary and Multiclass
modified as: and bb is the bias term. Finally
the bias term bb is determined Support: SVM is effective for
minimize w,b12∥w∥2+C both binary classification and
by the support vectors, which
∥w∥2+C∑i=1mζi
∑i=1mζiw,bminimize 21 multiclass classification
satisfy:
ti(wTxi−b)=1⇒b=wTxi−titi suitable for applications in text
Subject to the constraints: classification.
yi(wTxi+b)≥1−ζiandζi≥0for i (wTxi−b)=1⇒b=wTxi−ti
5. Memory Efficiency: It
=1,2,…,myi(wTxi+b)≥1−ζiandζi Where xixi is any support
focuses on support vectors
≥0for i=1,2,…,m vector.
making it memory efficient
Where: This completes the
compared to other algorithms.
 CC is a regularization mathematical framework of the
parameter that controls the Support Vector Machine
trade-off between margin algorithm which allows for both
linear and non-linear Disadvantages of Support
maximization and penalty for Vector Machine (SVM)
misclassifications. classification using the dual
problem and kernel trick. 1. Slow Training: SVM can be
 ζiζi are slack variables that slow for large datasets,
Types of Support Vector
represent the degree of affecting performance in SVM
Machine
violation of the margin by in data mining tasks.
Based on the nature of the
each data point. 2. Parameter Tuning Difficulty:
decision boundary, Support
Dual Problem for SVM Selecting the right kernel and
Vector Machines (SVM) can be
The dual problem involves divided into two main parts: adjusting parameters like C
maximizing the Lagrange  Linear SVM: Linear SVMs requires careful tuning,
multipliers associated with the use a linear decision impacting SVM algorithms.
support vectors. This boundary to separate the data 3. Noise Sensitivity: SVM
transformation allows solving points of different classes. struggles with noisy datasets
the SVM optimization using When the data can be and overlapping classes,
kernel functions for non-linear precisely linearly separated, limiting effectiveness in real-
classification. linear SVMs are very suitable. world scenarios.
The dual objective function is This means that a single 4. Limited Interpretability: The
given by: straight line (in 2D) or a complexity of the hyperplane
maximize α12∑i=1m∑j=1mαiα hyperplane (in higher in higher dimensions makes
jtitjK(xi,xj)−∑i=1mαiαmaximize dimensions) can entirely SVM less interpretable than
21∑i=1m∑j=1mαiαjtitjK(xi,xj) divide the data points into other models.
−∑i=1mαi their respective classes. 5. Feature Scaling Sensitivity:
Where: Non-Linear SVM: Non-Linear Proper feature scaling is
 αiαi are the Lagrange
SVM can be used to classify
essential, otherwise SVM
multipliers associated with models may perform poorly.
data when it cannot be
the ithith training sample.
 titi is the class label for separated into two classes by a
the ithith-th training sample. straight line (in the case of 2D).
 K(xi,xj)K(xi,xj) is the kernel By using kernel functions,
function that computes the nonlinear SVMs can handle What is Kernel?
similarity between data nonlinearly separable data. Instead of explicitly computing
points xixi and xjxj. The the transformation the kernel
computes the dot product of compare it to fruits you already approach the Voronoi diagram
data points in the higher- know. boundaries.
dimensional space directly that  If k = 3, the algorithm looks at  Voronoi Diagram as a
helps a model find patterns in the 3 closest fruits to the new Special Case: When k = 1
complex data and transforming one. KNN’s decision boundaries
the data into a higher-  If 2 of those 3 fruits are directly correspond to the
dimensional space where it apples and 1 is a banana, the Voronoi diagram of the
becomes easier to separate algorithm says the new fruit is training points. Each region in
different classes or detect an apple because most of its the Voronoi diagram
relationships. neighbors are apples. represents the area where the
Popular kernel functions in Distance Metrics Used in KNN nearest training point is
SVM Algorithm closest.
 Radial Basis Function To identify nearest neighbor we Ensemble Learning
(RBF): Captures patterns in use below distance metrics: Ensemble learning is a method
data by measuring the 1. Euclidean Distance where we use many small
distance between points and Euclidean distance is defined as models instead of just one. Each
is ideal for circular or the straight-line distance of these models may not be very
spherical relationships. It is between two points in a plane or strong on its own, but when we
widely used as it creates space. put their results together, we get
flexible decision boundary. distance(x,Xi)=∑j=1d(xj−Xij)2]dis a better and more accurate
 Linear Kernel: Works for data tance(x,Xi)=∑j=1d(xj−Xij)2] answer. It's like asking a group
that is linearly separable 2. Manhattan Distance of people for advice instead of
problem without complex This is the total distance you just one person—each one
transformations. would travel if you could only might be a little wrong, but
 Polynomial Kernel: Models move along horizontal and together, they usually give a
more complex relationships vertical lines like a grid or city better answer.
using polynomial equations. streets. It’s also called "taxicab Types of Ensembles Learning
 Sigmoid Kernel: Mimics distance" because a taxi can in Machine Learning
neural network behavior using only drive along the grid-like There are three main types of

d(x,y)=∑i=1n∣xi−yi∣d(x,y)=∑i=1n
sigmoid function and is streets of a city. ensemble methods:

∣xi−yi∣
suitable for specific non-linear 1. Bagging (Bootstrap
problems. Aggregating): Models are
3. Minkowski Distance trained independently on
different random subsets of
Minkowski distance is like a
K-Nearest Neighbor(KNN) the training data. Their results
family of distances, which
are then combined—usually
Algorithm includes both Euclidean and
by averaging (for regression)
K-Nearest Neighbors (KNN) is a Manhattan distances as special
or voting (for classification).
supervised machine learning cases.
This helps reduce variance
algorithm generally used for d(x,y)=(∑i=1n(xi−yi)p)1pd(x,y)=(
and prevents overfitting.
classification but can also be ∑i=1n(xi−yi)p)p1
2. Boosting: Models are trained
used for regression tasks. It From the formula above, when
one after another. Each new
works by finding the "k" closest p=2, it becomes the same as the
model focuses on fixing the
data points (neighbors) to a Euclidean distance formula and
errors made by the previous
given input and makes a when p=1, it turns into the
ones. The final prediction is a
predictions based on the Manhattan distance formula.
weighted combination of all
majority class (for classification) models, which helps reduce
or the average value (for Relationship Between KNN
bias and improve accuracy.
regression). Decision Boundaries and 3. Stacking (Stacked
Voronoi Diagrams Generalization): Multiple
What is 'K' in K Nearest In two-dimensional space the different models (often of
Neighbour? decision boundaries of KNN can different types) are trained
In the k-Nearest Neighbours be visualized as Voronoi and their predictions are used
algorithm k is just a number that diagrams. Here’s how: as inputs to a final model,
tells the algorithm how many  KNN Boundaries: The called a meta-model. The
nearby points or neighbors to decision boundary for KNN is meta-model learns how to
look at when it makes a determined by regions where best combine the predictions
decision. the classification changes of the base models, aiming for
Example: Imagine you're based on the nearest better performance than any
deciding which fruit it is based neighbors. K approaches individual model.
on its shape and size. You infinity, these boundaries 1. Bagging Algorithm
Bagging classifier can be used predicted correctly. One of the reduce variance, while
for both regression and most well-known boosting boosting reduces bias leading
classification tasks. Here is an algorithms is AdaBoost to better overall performance.
overview of Bagging classifier (Adaptive Boosting). Here is an
algorithm: overview of Boosting algorithm: Clustering in Machine
 Bootstrap Sampling: Divides  Initialize Model Weights: Learning
the original training data into Begin with a single weak Clustering is an unsupervised
‘N’ subsets and randomly learner and assign equal machine learning technique that
selects a subset with weights to all training groups similar data points
replacement in some rows examples. together into clusters based on
from other subsets. This step  Train Weak Learner: Train their characteristics, without
ensures that the base models weak learners on these using any labeled data. The
are trained on diverse subsets dataset. objective is to ensure that data
of the data and there is no  Sequential Learning: points within the same cluster
class imbalance. Boosting works by training are more similar to each other
 Base Model Training: For models sequentially where than to those in different
each bootstrapped sample we each model focuses on clusters, enabling the discovery
train a base model correcting the errors of its of natural groupings and hidden
independently on that subset predecessor. Boosting patterns in complex datasets.
of data. These weak models typically uses a single type of  Goal: Discover the natural
are trained in parallel to weak learner like decision grouping or structure in
increase computational trees. unlabeled data without
efficiency and reduce time  Weight Adjustment: predefined categories.
consumption. We can use Boosting assigns weights to  How: Data points are
different base learners i.e. training datapoints. assigned to clusters based on
different ML models as base Misclassified examples similarity or distance
learners to bring variety and receive higher weights in the measures.
robustness. next iteration so that next  Similarity Measures: Can
 Prediction Aggregation: To models pay more attention to include Euclidean distance,
make a prediction on testing them. cosine similarity or other
data combine the predictions Benefits of Ensemble metrics depending on data
of all base models. For Learning in Machine Learning type and clustering method.
classification tasks it can Ensemble learning is a versatile  Output: Each group is
include majority voting or approach that can be applied to
assigned a cluster ID,
weighted majority while for machine learning model for:
representing shared
regression it involves  Reduction in Overfitting: By
characteristics within the
averaging the predictions. aggregating predictions of cluster.
 Out-of-Bag (OOB) multiple model's ensembles Types of Clustering
Evaluation: Some samples can reduce overfitting that Let's see the types of clustering,
are excluded from the training individual complex models 1. Hard Clustering: In hard
subset of particular base might exhibit. clustering, each data point
models during the  Improved Generalization: It strictly belongs to exactly one
bootstrapping method. These generalizes better to unseen cluster, no overlap is allowed.
“out-of-bag” samples can be data by minimizing variance This approach assigns a clear
used to estimate the model’s and bias. membership, making it easier to
performance without the need  Increased Accuracy: interpret and use for definitive
for cross-validation. Combining multiple models segmentation tasks.
 Final Prediction: After gives higher predictive  Example: If clustering
aggregating the predictions accuracy. customer data into 2
from all the base models,  Robustness to Noise: It segments, each customer
Bagging produces a final mitigates the effect of noisy or belongs fully to either Cluster
prediction for each instance. incorrect data points by 1 or Cluster 2 without partial
averaging out predictions memberships.
2. Boosting Algorithm from diverse models.  Use cases: Market
Boosting is an ensemble  Flexibility: It can work with segmentation, customer
technique that combines diverse models including grouping, document
multiple weak learners to create decision trees, neural clustering.
a strong learner. Weak models networks and support vector  Limitations: Cannot
are trained in series such that machines making them highly represent ambiguity or
each next model tries to correct adaptable. overlap between groups;
errors of the previous model  Bias-Variance Tradeoff: boundaries are crisp.
until the entire training dataset is Techniques like bagging
2. Soft Clustering: Soft clusters, though its effectiveness each data point to belong to
clustering assigns each data depends on chosen density multiple clusters with varying
point a probability or degree of parameters. degrees of membership. This
membership to multiple clusters Algorithms: approach captures ambiguity
simultaneously, allowing data  DBSCAN (Density-Based and soft boundaries in data and
points to partially belong to Spatial Clustering of is particularly useful when the
several groups. Applications with clusters overlap or boundaries
 Example: A data point may Noise): Groups points with are not clear-cut.
have a 70% membership in sufficient neighbors; labels Algorithm:
Cluster 1 and 30% in Cluster sparse points as noise.  Fuzzy C-Means: Similar to K-
2, reflecting uncertainty or  OPTICS (Ordering Points To means but with fuzzy
overlap in group Identify Clustering memberships updated
characteristics. Structure): Extends DBSCAN iteratively.
 Use cases: Situations with to handle varying densities. Use Cases
overlapping class boundaries, 3. Connectivity-based  Customer Segmentation:
fuzzy categories like customer Clustering (Hierarchical Grouping customers based on
personas or medical Clustering) behavior or demographics for
diagnosis. targeted marketing and
Connectivity-based (or
 Benefits: Captures ambiguity
hierarchical) clustering builds
personalized services.
in data, models gradual nested groupings of data by
 Anomaly Detection:
transitions between clusters. evaluating how data points are Identifying outliers or
Types of Clustering Methods connected to their neighbors. It fraudulent activities in finance,
Clustering methods can be creates a dendrogram—a tree- network security and sensor
classified on the basis of how like structure—that reflects data.
they for clusters, relationships at various  Image Segmentation:
1. Centroid-based Clustering granularity levels and does not Dividing images into
(Partitioning Methods) require specifying cluster meaningful parts for object
Centroid-based clustering numbers in advance, but can be detection, medical diagnostics
organizes data points around computationally intensive. or computer vision tasks.
central prototypes called Approaches:  Recommendation Systems:
centroids, where each cluster is  Agglomerative (Bottom-up): Clustering user preferences to
represented by the mean (or Start with each point as a recommend movies, products
medoid) of its members. The cluster; iteratively merge or content tailored to different
number of clusters is specified closest clusters. groups.
in advance and the algorithm  Divisive (Top-down): Start  Market Basket Analysis:
allocates points to the nearest with one cluster; iteratively Discovering products
centroid, making this technique split into smaller clusters. frequently bought together to
efficient for spherical and 4. Distribution-based optimize store layouts and
similarly sized clusters but promotions.
Clustering
sensitive to outliers and
Distribution-based clustering
initialization. REINFORCE Algorithm
assumes data is generated from
Algorithms: REINFORCE is a method used
a mixture of probability
 K-means: Iteratively assigns
distributions, such as Gaussian in reinforcement learning to
points to nearest centroid and improve how decisions are
distributions and assigns points
recalculates centroids to made. It learns by trying actions
to clusters based on statistical
minimize intra-cluster and then adjusting the chances
likelihood. This method supports
variance. of those actions based on the
clusters with flexible shapes and
 K-medoids: Similar to K-
overlaps, but usually requires total reward received afterward.
means but uses actual data specifying the number of How REINFORCE Works
points (medoids) as centers, distributions. The REINFORCE algorithm
robust to outliers. Algorithm: works in the following steps:
2. Density-based Clustering  Gaussian Mixture Model  Collect Episodes: The agent
(Model-based Methods) (GMM): Fits data as a interacts with the environment
Density-based clustering defines weighted mixture of Gaussian for a fixed number of steps or
clusters as contiguous regions distributions; assigns data until an episode is complete,
of high data density separated points based on likelihood. following the current policy.
by areas of lower density. This This generates a trajectory
approach can identify clusters of consisting of states, actions
arbitrary shapes, handles noise 5. Fuzzy Clustering and rewards.
well and does not require  Calculate Returns: For each
Fuzzy clustering extends
predefining the number of time step tt, calculate the
traditional methods by allowing
return GtGt which is the total trajectory and the return GtGt games or board games like
reward obtained from can fluctuate significantly, chess. The player learns by
time tt onwards. Typically, making the learning process playing the game many times
this is the discounted sum of noisy and slow. and figure out what moves led
rewards:  Sample Inefficiency: Since to win.
Gt=∑k=tTγk−tGt=∑k=tTγk−t REINFORCE requires  Self-driving
Where γγ is the discount complete episodes to update cars: REINFORCE can help
factor, TT is the final time step the policy, it tends to be improve how self-driving cars
of the episode and RkRk is the sample-inefficient. The agent decide to drive safely and
reward received at time step kk. may have to spend a lot of efficiently by rewarding good
 Policy Gradient Update: The time trying things out before it driving decisions.
policy parameters θθ are gets helpful feedback to learn
updated using the following from. Principal Component
formula:  Convergence Issues: Analysis(PCA)
θt+1=θt+α∇θlog⁡πθ(at∣st)Gtθt+1 Because the results can be PCA (Principal Component
=θt+α∇θlogπθ(at∣st)Gt very random and learning is Analysis) is a dimensionality
Where: slow REINFORCE needs a lot reduction technique used in
αα is the learning rate. of practice before it learns a data analysis and machine
πθ(at∣st)πθ(at∣st) is the good way to act. learning. It helps you to reduce
probability of taking action atat Variants of REINFORCE the number of features in a
at state stst, according to the Several modifications to the dataset while keeping the
policy. original REINFORCE algorithm most important information. It
GtGt is the return or have been proposed to address changes your original features
its high variance: into new features these new
cumulative reward obtained
 Baseline: By subtracting a
The gradient ∇θlog⁡πθ(at∣st)∇θ
from time step tt onwards. features don’t overlap with each
baseline value (typically the other and the first few keep most
value function V(s)V(s)) from of the important differences
logπθ(at∣st) represents how
the return GtGt, the variance found in the original data.
much the policy probability for
of the gradient estimate can Step 1: Standardize the Data
action atat at state stst should
be reduced without affecting
be adjusted based on the Different features may have
the expected gradient. This
obtained return. different units and scales like
results in a variant known as
 Repeat: This process is REINFORCE with a baseline.
salary vs. age. To compare
repeated for several them fairly PCA
The update rule becomes:
episodes, iteratively updating first standardizes the data by
θt+1=θt+α∇θlog⁡πθ(at∣st) making each feature have:
the policy in the direction of
(Gt−bt)θt+1=θt+α∇θlogπθ(at∣st  A mean of 0
higher rewards.
)(Gt−bt)  A standard deviation of 1
Advantages of REINFORCE
Where btbt is the baseline
 Easy to Z=X−μσZ=σX−μ
such as the expected reward where:
Understand: REINFORCE is
from state stst.  μμ is the mean of
simple and easy to use and a
good way to start learning  Actor-Critic: It is a method independent features
that use two parts to learn
⋯,μm}
about how to improve μ={μ1,μ2,⋯,μm}μ={μ1,μ2,
decision in reinforcement better: the actor and the critic.
learning. The actor chooses what
action to take while  σσ is the standard deviation of
 Directly Improves independent features
Decisions: It works by the critic checks how good

⋯,σm}
that action was and give σ={σ1,σ2,⋯,σm}σ={σ1,σ2,
directly improving the way
actions are chosen which is feedback. This helps to make
learning more stable and Step 2: Calculate Covariance
helpful when there are many
faster by reducing random Matrix
possible actions or choices.
 Good for Tasks with Clear mistakes. Next PCA calculates
Applications of REINFORCE the covariance matrix to see
Endings: It works well when
REINFORCE has been applied how features relate to each
tasks have a clear finish and
in several domains: other whether they increase or
the agent gets a total reward
at the end.
 Robotics: REINFORCE helps decrease together. The
Challenges of REINFORCE robots to learn how to do covariance between two
 High Variance: One of the things like picking up objects features x1x1 and x2x2 is:
or moving around. The robot cov(x1,x2)=∑i=1n(x1i−x1ˉ)
major issues with
try different actions and learn (x2i−x2ˉ)n−1cov(x1,x2)=n−1∑i=
REINFORCE is its high
from what works well or not. 1n(x1i−x1ˉ)(x2i−x2ˉ)
variance. The gradient
estimate is based on a single
 Game AI: It is used to teach Where:
game players like in video
 xˉ1andxˉ2xˉ1andxˉ2 are the 2. Noise Reduction: Eliminates
mean values of components with low variance
features x1andx2x1andx2 enhance data clarity.
 nn is the number of data 3. Data
points Compression: Represents
The value of covariance can be data with fewer components
positive, negative or zeros. reduce storage needs and
Step 3: Find the Principal speeding up processing.
Components 4. Outlier Detection: Identifies
unusual data points by
PCA identifies new axes where
showing which ones deviate
the data spreads out the most:
significantly in the reduced
 1st Principal Component
space.
(PC1): The direction of
Disadvantages of Principal
maximum variance (most
Component Analysis
spread).
1. Interpretation
 2nd Principal Component
Challenges: The new
(PC2): The next best
components are combinations
direction, perpendicular to
of original variables which can
PC1 and so on.
be hard to explain.
These directions come from
2. Data Scaling
the eigenvectors of the
Sensitivity: Requires proper
covariance matrix and their
scaling of data before
importance is measured
application or results may be
by eigenvalues. For a square
misleading.
matrix A an eigenvector X (a
3. Information Loss: Reducing
non-zero vector) and its
dimensions may lose some
corresponding eigenvalue λ
important information if too
satisfy:
few components are kept.
AX=λXAX=λX
4. Assumption of
This means:
Linearity: Works best when
 When A acts on X it only relationships between
stretches or shrinks X by the variables are linear and may
scalar λ. struggle with non-linear data.
 The direction of X remains 5. Computational
unchanged hence Complexity: Can be slow and
eigenvectors define "stable resource-intensive on very
directions" of A. large datasets.
Eigenvalues help rank these 6. Risk of Overfitting: Using
directions by importance. too many components or
Step 4: Pick the Top working with a small dataset
Directions & Transform Data might lead to models that
After calculating the eigenvalues don't generalize well.
and eigenvectors PCA ranks
them by the amount of
information they capture. We
then:
1. Select the top k
components hat capture
most of the variance like 95%.
2. Transform the original
dataset by projecting it onto
these top components.
Advantages of Principal
Component Analysis
1. Multicollinearity
Handling: Creates new,
uncorrelated variables to
address issues when original
features are highly correlated.

You might also like