Machinelearning GateNotes
Machinelearning GateNotes
Materials
Machine Learning
by Piyush Wairale
Instructions:
• Kindly go through the lectures/videos on our website www.piyushwairale.com
• Read this study material carefully and make your own handwritten short notes. (Short notes
must not be more than 5-6 pages)
• Revise this material at least 5 times and once you have prepared your short notes, then
revise your short notes twice a week
• If you are not able to understand any topic or required a detailed explanation and if there
are any typos or mistake in study materials. Mail me at piyushwairale100@gmail.com
Contents
1 Introduction to Machine Learning 5
1.1 Introduction to Data in Machine Learning . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Different types of learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Supervised Learning 14
3 Regression: 15
3.0.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.0.2 Multivariate Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.0.3 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Logistic Regression 27
4.1 Logistic Function (Sigmoid Function) . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Type of Logistic Regression: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 K-Nearest Neighbors 30
5.1 Working . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 K-Nearest Neighbors (KNN) Classification Example . . . . . . . . . . . . . . . . . 33
5.3 Advantages of K-Nearest Neighbors (KNN): . . . . . . . . . . . . . . . . . . . . . 33
5.4 Disadvantages of K-Nearest Neighbors (KNN): . . . . . . . . . . . . . . . . . . . . 34
7 Decision Trees 40
7.1 Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.2 Measures of impurity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.3 Advantages of Decision Trees: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.4 Disadvantages of Decision Trees: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
9 Bias-Variance Trade-Off 51
9.1 Underfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.2 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
9.3 Bias – variance trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
10 Cross-validation methods 55
10.1 K-Fold Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2
10.2 Leave-One-Out Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
12 Multi-Layer Perceptron 61
12.1 Architecture: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
12.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
12.3 Hyperparameter Tuning: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
14 K-Means Clustering 68
14.1 Methods to Determine the Optimal Number of Clusters (K) in K-means Clustering 74
14.2 Comparison of Elbow and Silhouette Methods . . . . . . . . . . . . . . . . . . . . 76
14.3 K-medoids algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
15 Hierarchical clustering 78
15.1 Hierarchical Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
15.1.1 Dendrogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
15.1.2 Agglomerative Hierarchical Clustering (Bottom-Up) . . . . . . . . . . . . . 80
15.1.3 Divisive Hierarchical Clustering (Top-Down) . . . . . . . . . . . . . . . . . 84
15.2 Measures of dissimilarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
15.2.1 Measures of distance between data points . . . . . . . . . . . . . . . . . . . 85
15.3 Single linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
15.4 Multiple Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
16 Dimensionality Reduction 89
16.1 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . 91
16.2 Linear Discriminant Analysis (LDA) . . . . . . . . . . . . . . . . . . . . . . . . . 95
Youtube Channel
Telegram Group
• Machine Learning is the field of study that gives computers the capability to learn without
being explicitly programmed. ML is one of the most exciting technologies that one would
have ever come across. As it is evident from the name, it gives the computer that makes it
more similar to humans: The ability to learn. Machine learning is actively being used today,
perhaps in many more places than one would expect.
• The field of study known as machine learning is concerned with the question of how to
construct computer programs that automatically improve with experience
Defination of Learning
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks T, as measured by P, improves with experience
E.
A computer program which learns from experience is called a machine learning program or simply
a learning program. Such a program is sometimes also referred to as a learner.
Examples
• Data is a crucial component in the field of Machine Learning. It refers to the set of obser-
vations or measurements that can be used to train a machine-learning model.
• The quality and quantity of data available for training and testing play a significant role in
determining the performance of a machine-learning model.
• Data can be in various forms such as numerical, categorical, or time-series data, and can
come from various sources such as databases, spreadsheets, or APIs.
• Machine learning algorithms use data to learn patterns and relationships between input
variables and target outputs, which can then be used for prediction or classification tasks.
Understanding data
Since an important component of the machine learning process is data storage, we briefly consider
in this section the different types and forms of data that are encountered in the machine learning
process.
Unit of observation
By a unit of observation we mean the smallest entity with measured properties of interest for a
study.
Examples
• A time point
• A geographic region
• A measurement
Figure 1: Figure 1: Example for “examples” and “features” collected in a matrix format (data
relates to automobiles and their features)
Type of Data
Data is typically divided into two types:
1. Labeled data
2. Unlabeled data
Labeled data includes a label or target variable that the model is trying to predict, whereas
unlabeled data does not include a label or target variable. The data used in machine learning
is typically numerical or categorical. Numerical data includes values that can be ordered and
measured, such as age or income. Categorical data includes values that represent categories, such
as gender or type of fruit.
• The training set is used to train the model, and the testing set is used to evaluate the
performance of the model.
• It is important to ensure that the data is split in a random and representative way.
• Data preprocessing is an important step in the machine learning pipeline. This step can
include cleaning and normalizing the data, handling missing values, and feature selection or
engineering.
1. Training Data: The part of data we use to train our model. This is the data that your
model actually sees(both input and output) and learns from.
2. Validation Data: The part of data that is used to do a frequent evaluation of the model,
fit on the training dataset along with improving involved hyperparameters (initially set
parameters before the model begins learning). This data plays its part when the model is
actually training.
3. Testing Data: Once our model is completely trained, testing data provides an unbiased
evaluation. When we feed in the inputs of Testing data, our model will predict some val-
ues(without seeing actual output).
After prediction, we evaluate our model by comparing it with the actual output present in
the testing data. This is how we evaluate and see how much our model has learned from the
experiences feed in as training data, set at the time of training.
3. Ordinal data This denotes a nominal variable with categories falling in an ordered list.
Examples include clothing sizes such as small, medium, and large, or a measurement of
customer satisfaction on a scale from “not at all happy” to “very happy.”
Examples In the data given in Fig.1, the features “year”, “price” and “mileage” are numeric and
the features “model”, “color” and “transmission” are categorical.
Properties of Data
• Volume: Scale of Data. With the growing world population and technology at exposure,
huge data is being generated each and every millisecond.
• Viability: The ability of data to be used and integrated into different systems and processes.
• Security: The measures taken to protect data from unauthorized access or manipulation.
• Accessibility: The ease of obtaining and utilizing data for decision-making purposes. In-
tegrity: The accuracy and completeness of data over its entire lifecycle.
1. Supervised learning
• Supervised learning is the machine learning task of learning a function that maps an
input to an output based on example input-output pairs.
• In supervised learning, each example in the training set is a pair consisting of an input
object (typically a vector) and an output value.
• A supervised learning algorithm analyzes the training data and produces a function,
which can be used for mapping new examples.
• In the optimal case, the function will correctly determine the class labels for unseen
instances.
• Both classification and regression problems are supervised learning problems.
• A wide range of supervised learning algorithms are available, each with its strengths
and weaknesses. There is no single learning algorithm that works best on all supervised
learning problems.
• Important Point :A “supervised learning” is so called because the process of an
algorithm learning from the training dataset can be thought of as a teacher supervising
the learning process. We know the correct answers (that is, the correct outputs),
the algorithm iteratively makes predictions on the training data and is corrected by the
teacher. Learning stops when the algorithm achieves an acceptable level of performance.
Example:
Consider the following data regarding patients entering a clinic. The data consists of the
gender and age of the patients and each patient is labeled as “healthy” or “sick”.
Based on this data, when a new patient enters the clinic, how can one predict whether he/she
is healthy or sick?
2. Unsupervised learning
gender age label
M 48 sick
M 67 sick
F 53 healthy
M 49 healthy
F 34 sick
M 21 healthy
Example Consider the following data regarding patients entering a clinic. The data consists
of the gender and age of the patients.
gender age
M 48
M 67
F 53
M 49
F 34
M 21
Based on this data, can we infer anything regarding the patients entering the clinic?
3. Reinforcement learning
• Supervised learning is a machine learning paradigm where algorithms aim to optimize pa-
rameters to minimize the difference between target and computed outputs, commonly used
in tasks like classification and regression.
• In supervised learning, training examples are associated with target outputs (initially la-
beled) and computed outputs (generated by the learning algorithm), and the goal is to
minimize misclassification or error.
• The learning process in supervised learning involves initializing parameters randomly, com-
puting output values for labeled examples, and updating parameters iteratively to minimize
errors or achieve convergence.
• The value of the output variable may be a number, such as an integer or a floating point
value. These are often quantities, such as amounts and sizes. The input variables may be
discrete or real-valued.
• Regression algorithms are used if there is a relationship between the input variable and the
output variable.
• It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc.
General Approach
Let x denote the set of input variables and y the output variable. In machine learning, the general
approach to regression is to assume a model, that is, some mathematical relation between x and
y, involving some parameters say, θ, in the following form:
y = f (x, θ)
The function f (x, θ) is called the regression function. The machine learning algorithm optimizes
the parameters in the set θ such that the approximation error is minimized; that is, the estimates
of the values of the dependent variable y are as close as possible to the correct values given in the
training set.
Example
For example, if the input variables are “Age”, “Distance” and “Weight” and the output variable
is “Price”, the model may be
where x =(Age, Distance, Weight) denotes the set of input variables and θ = (a0 , a1 , a2 , a3 )
denotes the set of parameters of the model.
Various types of regression techniques These techniques mostly differ in three aspects,
namely, the number and type of independent variables, the type of dependent variables and the
shape of regression line. Some of these are listed below.
1. Simple linear regression: There is only one continuous independent variable x and the as-
sumed relation between the independent variable and the dependent variable y is
y = a + bx
2. Multivariate linear regression: There are more than one independent variable, say x1, ..., xn,
and the assumed relation between the independent variables and the dependent variable is
y = a0 + a1 x1 + .... + an xn
3. Polynomial regression: There is only one continuous independent variable x and the assumed
model is y = a0 + a1 x + .... + an xn
It is a variant of the multiple linear regression model, except that the best fit line is curved
rather than straight.
4. Ridge regression: Ridge regression is one of the types of linear regression in which a small
amount of bias is introduced so that we can get better long-term predictions.
Ridge regression is a regularization technique, which is used to reduce the complexity of the
model. It is also called as L2 regularization.
5. Logistic regression: The dependent variable is binary, that is, a variable which takes only
the values 0 and 1. The assumed model involves certain probability distributions.
3.0.1 Linear Regression
Simple linear regression is a basic machine learning technique used for modeling the relationship
between a single independent variable (often denoted as ”x”) and a dependent variable (often
denoted as ”y”). It assumes a linear relationship between the variables and aims to find the best-
fitting line (typically represented by the equation y = mx + b) that minimizes the sum of squared
differences between the observed data points and the values predicted by the model.
It can be shown that the values of a and b can be computed using the following formulas:
Advantages of Simple Linear Regression:
Example
Obtain a linear regression for the data in below table assuming that y is the independent variable.
Therefore, the linear regression model for the data is y = 0.785 + 0.425x
3.0.2 Multivariate Linear Regression
Multiple Linear Regression is a machine learning technique used to model the relationship be-
tween a dependent variable (target) and multiple independent variables (features) by fitting a
linear equation to the data.
The model can be expressed as: y = β0 + β1 x1 + .... + βN xN
Where:
y is the dependent variable (the one you want to predict).
x1 , x2 , ..., xn are the independent variables.
β0 is the y-intercept.
β1 , β2 , ...βn are the coefficients associated with each independent variable.
As in simple linear regression, here also we use the ordinary least squares method to obtain the
optimal estimates of β0 , β1 , ...βn The method yields the following procedure for the computation
of these optimal estimates. Let
• The multivariate regression method helps you find a relationship between multiple variables
or features.
• It also defines the correlation between independent variables and dependent variables.
Disadvantages:
• It is complex.
• Multivariate regression yields better results when used with larger datasets rather than small
ones.
Example:
Fit a multiple linear regression model to the following data:
In this problem, there are two independent variables and four sets of values of the variables.
Thus, in the notations used above, we have n = 2 and N = 4. The multiple linear regression model
for this problem has the form y = β0 + β1 x1 + β2 x2
The required model is y = 2.0625 − 2.3750x1 + 3.2500x2
3.0.3 Ridge Regression
(Important for GATE DA)
Ridge Regression, also known as L2 regularization, is a machine learning technique used to mitigate
the issues of multicollinearity and overfitting in Multiple Linear Regression. It adds a regulariza-
tion term to the regression equation, modifying the loss function, and aims to minimize the sum
of squared errors along with the sum of the squared coefficients, which can be expressed as:
• When data exhibits multicollinearity, that is, the ridge regression technique is applied when
the independent variables are highly correlated. While least squares estimates are unbiased
in multicollinearity, their variances are significant enough to cause the observed value to
diverge from the actual value. Ridge regression reduces standard errors by biassing the
regression estimates.
• The lambda (λ) variable in the ridge regression equation resolves the multicollinearity prob-
lem.
• Lambda (λ) is the penalty term. So, by changing the values of (λ), we are controlling the
penalty term. The higher the values of (λ), the bigger is the penalty and therefore the
magnitude of coefficients is reduced.
• In this technique, the cost function is altered by adding the penalty term to it. The amount
of bias added to the model is called Ridge Regression penalty. We can calculate it by
multiplying with the lambda to the squared weight of each individual feature.
• In the above equation, the penalty term regularizes the coefficients of the model, and hence
ridge regression reduces the amplitudes of the coefficients that decreases the complexity of
the model.
• As we can see from the above equation, if the values of (λ) tend to zero, the equation becomes
the cost function of the linear regression model. Hence, for the minimum value of (λ) , the
model will resemble the linear regression model.
• A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
• Linear Relationship: Ridge Regression assumes that there is a linear relationship between
the independent variables and the dependent variable.
• Homoscedasticity: Ridge Regression assumes that the variance of the errors is constant
across all levels of the independent variables.
• Independence of errors: Ridge Regression assumes that the errors are independent of
each other, i.e., the errors are not correlated.
• Normality of errors: Ridge Regression assumes that the errors follow a normal distribu-
tion.
• Regularization: Ridge regression adds a penalty term that discourages the magnitude of
the coefficients from becoming too large. This helps prevent overfitting.
• Multicollinearity Mitigation: It’s particularly effective when you have highly correlated
independent variables (multicollinearity) by shrinking the coefficients and making the model
more stable.
• Lambda Parameter: The choice of the lambda parameter (λ) is essential. A small λ is
close to standard linear regression, while a large λ results in stronger regularization.
• Balancing Act: Ridge regression performs a balancing act between fitting the data well
and preventing overfitting. It maintains all predictors but assigns smaller coefficients to less
important ones.
• Model Stability: It makes the model more stable, especially when you have a high-
dimensional dataset with many predictors. This can lead to better generalization to new,
unseen data.
• Interpretability: Ridge regression may make the model less interpretable because it shrinks
coefficients toward zero. It can be challenging to discern the individual importance of pre-
dictors.
• Tuning Lambda: Cross-validation is often used to tune the lambda parameter and find
the optimal trade-off between fitting the data and regularization.
• Ineffective for Feature Selection: Ridge regression does not perform feature selection.
It retains all predictors in the model but assigns smaller coefficients to less important ones.
• Less Effective for Sparse Data: In cases where many predictors are irrelevant or unim-
portant, Ridge may not eliminate them from the model effectively.
What is Regularization?
• Sometimes the machine learning model performs well with the training data but does not
perform well with the test data. It means the model is not able to predict the output when
deals with unseen data by introducing noise in the output, and hence the model is called
overfitted. This problem can be deal with the help of a regularization technique.
• This technique can be used in such a way that it will allow to maintain all variables or features
in the model by reducing the magnitude of the variables. Hence, it maintains accuracy as
well as a generalization of the model.
• It mainly regularizes or reduces the coefficient of features toward zero. In simple words,
”In regularization technique, we reduce the magnitude of the features by keeping the same
number of features.”
What is Shrinkage?
• Shrinkage refers to the process of shrinking the estimated regression coefficients towards
zero. This is done by adding a penalty term to the sum of squared residuals in the regression
equation, which is called the regularization term.
• The regularization term is proportional to the square of the magnitude of the regression
coefficients, and it is controlled by a tuning parameter, usually denoted as λ. The higher
the value of λ, the more the coefficients are shrunk towards zero.
• Shrinkage helps to reduce the variance of the estimates and can improve the prediction
accuracy of the model.
What is Multicollinearity?
• Multicollinearity basically happens when more than two anticipated variables have substan-
tial correlations with one another.
• Multicollinearity can be introduced by using multiple data sources. This could happen as
a result of limitations placed on linear or demographic models, an overly precise model,
outliers, or model design or choice made during the data collection process.
• Multicollinearity may be introduced during the data collection process if the data were
gathered using an inappropriate sampling method. Even if the sample size is smaller than
expected, it could still happen.
• Because there are more variables than data, multicollinearity will be visible if the model is
overspecified.
4 Logistic Regression
• Logistic Regression is a machine learning algorithm used for binary classification tasks,
modeling the probability of an event occurring or not, by fitting a logistic curve to the data.
It’s expressed as the logistic function, which maps the linear combination of input features
to values between 0 and 1.
• Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1). The curve from the logistic function
indicates the likelihood of something such as whether the cells are cancerous or not, a mouse
is obese or not based on its weight, etc.
• Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
• Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification.
• It is used for predicting the categorical dependent variable using a given set of independent
variables.
4.1 Logistic Function (Sigmoid Function)
• The sigmoid function is a mathematical function used to map the predicted values to prob-
abilities. It maps any real value into another value within a range of 0 and 1. o The value
of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the “S” form.
• The S-form curve is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the probability
of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.
• Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as “low”, “Medium”, or “High”.
• Logistic Function: Utilizes the logistic (sigmoid) function to convert a linear combination
of input features into a probability value between 0 and 1.
• Coefficient Interpretation: Coefficients represent the impact of each feature on the prob-
ability of the event, making the model interpretable.
• K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
• K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
• It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
• KNN tries to predict the correct class for the test data by calculating the distance between
the test data and all the training points. Then select the K number of points which is closest
to the test data. The KNN algorithm calculates the probability of the test data belonging to
the classes of ‘K’ training data and class that holds the highest probability will be selected.
In the case of regression, the value is the mean of the ‘K’ selected training points.
Voronoi diagram, deals with when value of K =1
Describes the areas that are nearest to any given point, given a set of data. Each line segment is
equidistant between two points of opposite class
Choosing Value of K
• Larger k may lead to better performance But if we set k too large we may end up looking
at samples that are not neighbors (are far away from the query)
5.1 Working
The K-NN working can be explained on the basis of the below algorithm:
4. Among these k neighbors, count the number of the data points in each category.
5. Assign the new data points to that category for which the number of the neighbor is maxi-
mum.
• Adaptability: KNN can be used for both classification and regression tasks, and it can
handle multi-class problems without modification.
• Sensitivity to Feature Scaling: KNN is sensitive to the scale of features, so it’s essential
to normalize or standardize your data before applying the algorithm.
• Determining the Optimal K: Selecting the right value for K is crucial, and choosing an
inappropriate K can lead to underfitting or overfitting. There’s no universally optimal value,
and it often requires experimentation.
• Imbalanced Data: KNN can be biased towards the majority class in imbalanced datasets.
It’s essential to balance the dataset or adjust the class weights when necessary.
6 Naive Bayes Classifier
Naı̈ve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and
used for solving classification problems. It is mainly used in text classification that includes a
high-dimensional training dataset. Naı̈ve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine learning models that can make
quick predictions. It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object. Some popular examples of Naı̈ve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
Where,
P(A—B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B—A) is Likelihood probability: Probability of the evidence given that the probability of
a hypothesis is true. P(A) is Prior Probability: Probability of hypothesis before observing the
evidence.
P(B) is Marginal Probability: Probability of Evidence.
Assumption
The fundamental Naive Bayes assumption is that each feature makes an independent and equal
contribution to the outcome.
With relation to our dataset, this concept can be understood as:
We assume that no pair of features are dependent. For example, the temperature being ‘Hot’ has
nothing to do with the humidity or the outlook being ‘Rainy’ has no effect on the winds. Hence,
the features are assumed to be independent.
Secondly, each feature is given the same weight(or importance). For example, knowing only
temperature and humidity alone can’t predict the outcome accurately. None of the attributes is
irrelevant and assumed to be contributing equally to the outcome.
Here are the key concepts and characteristics of the Naive Bayes classifier:
• Bayes’ Theorem: The classifier is based on Bayes’ theorem, which calculates the proba-
bility of a hypothesis (in this case, a class label) given the evidence (features or attributes).
Mathematically, it is expressed as P(class—evidence) = [P(evidence—class) * P(class)] /
P(evidence).
• Independence Assumption: The ”Naive” in Naive Bayes refers to the assumption that
all features are independent of each other, given the class label. In reality, this assumption
is often not true, but the simplification makes the algorithm computationally efficient and
easy to implement.
1. Multinomial Naive Bayes: Typically used for text classification where features represent
word counts.
2. Gaussian Naive Bayes: Suitable for continuous data and assumes a Gaussian distribu-
tion of features.
3. Bernoulli Naive Bayes: Applicable when features are binary, such as presence or ab-
sence.
• Prior Probability (P(class)): This is the probability of a class occurring before observing
any evidence. It can be calculated from the training data.
• Classification: To classify a new data point, the classifier calculates the posterior proba-
bilities for each class and selects the class with the highest probability.
6.2 Advantages of Naive Bayes:
• Simplicity: Naive Bayes is easy to implement and understand, making it a good choice for
quick classification tasks.
• Efficiency: It can handle a large number of features efficiently, particularly in text classifi-
cation.
• Works Well with Small Datasets: It can perform reasonably well even with limited training
data.
• Interpretable: The results are easy to interpret, as it provide the probability of belonging to
each class.
• Sensitivity to Feature Distribution: It may not perform well when features have complex,
non-Gaussian distributions.
• Requires Sufficient Data: For some cases, Naive Bayes might not perform well when there
is a scarcity of data.
• Zero Probability Problem: If a feature-class combination does not exist in the training data,
the probability will be zero, causing issues. Smoothing techniques are often used to address
this.
7 Decision Trees
• A decision tree is a simple model for supervised classification. It is used for classifying a
single discrete target feature.
• Each internal node performs a Boolean test on an input feature (in general, a test may have
more than two options, but these can be converted to a series of Boolean tests). The edges
are labeled with the values of that input feature.
• Classifying an example using a decision tree is very intuitive. We traverse down the tree,
evaluating each test and following the corresponding edge. When a leaf is reached, we return
the classification on that leaf.
• Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are
the output of those decisions and do not contain any further branches.
• The decisions or the test are performed on the basis of features of the given dataset. It is
a graphical representation for getting all the possible solutions to a problem/decision based
on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
• In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.
7.1 Terminologies
• Root Node: A decision tree’s root node, which represents the original choice or feature
from which the tree branches, is the highest node.
• Internal Nodes (Decision Nodes): Nodes in the tree whose choices are determined by
the values of particular attributes. There are branches on these nodes that go to other nodes.
• Leaf Nodes (Terminal Nodes): The branches’ termini, when choices or forecasts are
decided upon. There are no more branches on leaf nodes.
• Branches (Edges): Links between nodes that show how decisions are made in response to
particular circumstances.
• Splitting: The process of dividing a node into two or more sub-nodes based on a decision
criterion. It involves selecting a feature and a threshold to create subsets of data.
• Parent Node: A node that is split into child nodes. The original node from which a split
originates.
• Child Node: Nodes created as a result of a split from a parent node.
• Decision Criterion: The rule or condition used to determine how the data should be split
at a decision node. It involves comparing feature values against a threshold.
• Pruning: The process of removing branches or nodes from a decision tree to improve its
generalization and prevent overfitting.
2. Gini Index
• Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
• An attribute with a low Gini index should be preferred as compared to the high Gini
index.
• It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
Gini index can be calculated using the below formula:
• Handling Non-linearity: Decision Trees can capture non-linear relationships between features
and the target variable.
• Feature Selection: They can automatically select the most important features, reducing the
need for feature engineering.
• Versatility: Decision Trees can handle both categorical and numerical data.
• Efficiency: They are relatively efficient during prediction, with time complexity logarithmic
in the number of data points.
• Bias Toward Dominant Classes: In classification tasks, Decision Trees can be biased toward
dominant classes, leading to imbalanced predictions.
• Instability: Small variations in the data can lead to different tree structures, making them
unstable models.
• Greedy Algorithm: Decision Trees use a greedy algorithm, making locally optimal decisions
at each node, which may not lead to the global optimal tree structure
• The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
• SVMs pick best separating hyperplane according to some criterion e.g. maximum margin
• Training process is an optimisation
• SVM chooses the extreme points/vectors that help in creating the hyperplane. These ex-
treme cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine.
Consider the below diagram in which there are two different categories that are classified
using a decision boundary or hyperplane:
Here are the key concepts and characteristics of Support Vector Machines:
• Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the hyperplane of SVM.
• In a binary classification problem, an SVM finds a hyperplane that best separates the data
points of different classes. This hyperplane is the decision boundary.
• The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
• Support Vectors:The data points or vectors that are the closest to the hyperplane and
which affect the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector. They are critical for defining the
margin and determining the location of the hyperplane.
• Margin: The margin is the distance between the support vectors and the decision bound-
ary. SVM aims to maximize this margin because a larger margin often leads to better
generalization.
• C Parameter: The regularization parameter ”C” controls the trade-off between maximizing
the margin and minimizing the classification error. A smaller ”C” value results in a larger
margin but may allow some misclassifications, while a larger ”C” value allows for fewer
misclassifications but a smaller margin.
• Multi-Class Classification: SVMs are inherently binary classifiers, but they can be
extended to handle multi-class classification using techniques like one-vs-one (OvO) or one-
vs-all (OvA) classification.
• The Scalar Product:The scalar or dot product is, in some sense, a measure of Similarity
a.b = |a|.|b|cos(θ)
8.1 Kernels
We may use Kernel functions to implicitly map to a new feature space
• Kernel fn: K(x1 , x2 ) ∈ R
• Kernel must be equivalent to an inner product in some feature space
• Kernel Trick: SVM can handle non-linearly separable data by using a kernel function to
map the data into a higher-dimensional space where it becomes linearly separable. Common
kernel functions include linear, polynomial, radial basis function (RBF), and sigmoid kernels.
Examples of kernel functions
8.2 Classification Margin
8.3 Types of SVM
1. Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and
x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or
blue. So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes.
Hence, the SVM algorithm helps to find the best line or decision boundary; this best bound-
ary or region is called as a hyperplane. SVM algorithm finds the closest point of the lines
from both the classes. These points are called support vectors. The distance between the
vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with the maximum margin is called the optimal hyperplane.
2. Non-linear SVM: Non-linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier.
8.4 Advantages of Support Vector Machines:
• Effective in High-Dimensional Spaces: SVMs perform well even in high-dimensional feature
spaces.
• Robust to Overfitting: SVMs are less prone to overfitting, especially when the margin is
maximized. Accurate for Non-Linear Data: The kernel trick allows SVMs to work effectively
on non-linear data by transforming it into higher dimensions.
• Wide Applicability: SVMs can be applied to various tasks, including classification, regres-
sion, and outlier detection.
• Sensitivity to Kernel Choice: The choice of the kernel function and kernel parameters can
significantly impact the SVM’s performance.
• Challenging for Large Datasets: SVMs may not be suitable for very large datasets because
of their computational complexity.
• A key consideration in learning the target function from the training data is the extent of
generalization. This is because the input data is just a limited, specific view and the new,
unknown data in the test data set may be differing quite a bit from the training data.
• The fitness of a target function approximated by a learning algorithm determines how cor-
rectly it is able to classify a set of data it has never seen.
9.1 Underfitting
• If the target function is kept too simple, it may not be able to capture the essential nuances
and represent the underlying data well.
• A typical case of underfitting may occur when trying to represent a non-linear data with a
linear model as demonstrated by both cases of underfitting shown in figure 1.1
• Many times underfitting happens due to the unavailability of sufficient training data.
• Underfitting results in both poor performance with training data as well as poor generaliza-
tion to test data. Underfitting can be avoided by
1. using more training data
2. reducing features by effective feature selection
• Overfitting, in many cases, occur as a result of trying to fit an excessively complex model
to closely match the training data. This is represented with a sample data set in figure 1.1.
The target function, in these cases, tries to make sure all training data points are correctly
partitioned by the decision boundary. However, more often than not, this exact nature is
not replicated in the unknown test data set. Hence, the target function results in wrong
classification in the test data set.
• Overfitting results in good performance with training data set, but poor generalization and
hence poor performance with test data set. Overfitting can be avoided by
1. using re-sampling techniques like k-fold cross validation
2. hold back of a validation data set
3. remove the nodes which have little or no predictive power for the given machine learning
problem.
• Both underfitting and overfitting result in poor classification quality which is reflected by
low classification accuracy
• Parametric models generally have high bias making them easier to understand/interpret and
faster to learn.
• These algorithms have a poor performance on data sets, which are complex in nature and
do not align with the simplifying assumptions made by the algorithm.
• Errors due to variance occur from difference in training data sets used to train the model.
• Different training data sets (randomly sampled from the input data set) are used to train
the model. Ideally the difference in the data sets should not be significant and the model
trained using different training data sets should not be too different.
• However, in case of overfitting, since the model closely matches the training data, even a
small difference in training data gets magnified in the model.
So, the problems in training a model can either happen because either
(a) the model is too simple and hence fails to interpret the data grossly or
(b) the model is extremely complex and magnifies even small differences in the training data.
• Complex Models vs. Simple Models: Complex models (e.g., deep neural networks)
tend to have low bias but high variance, whereas simple models (e.g., linear regression) tend
to have high bias but low variance.
• Balancing Act: Machine learning practitioners aim to strike a balance between bias and
variance to achieve a model with good generalization, one that performs well on both the
training data and new, unseen data.
• Underfitting and Overfitting: The trade-off helps address the problems of underfitting
(high bias) and overfitting (high variance). Underfit models don’t capture enough of the
data’s complexity, while overfit models fit noise in the data.
• Model Complexity: Adjusting model complexity, such as the number of features, the
choice of hyperparameters, and regularization techniques, is a way to manage the bias-
variance trade-off.
• Cross-Validation: Cross-validation techniques, like k-fold cross-validation, help estimate a
model’s performance on unseen data and guide the selection of the optimal model complexity.
Important Note
Increasing the bias will decrease the variance, and Increasing the variance will decrease the bias
On one hand, parametric algorithms are generally seen to demonstrate high bias but low variance.
On the other hand, non-parametric algorithms demonstrate low bias and high variance.
As can be observed in Figure 1.3.1, the best solution is to have a model with low bias as well as
low variance. However, that may not be possible in reality. Hence, the goal of supervised machine
learning is to achieve a balance between bias and variance. The learning algorithm chosen and the
user parameters which can be configured helps in striking a tradeoff between bias and variance.
For example, in a popular supervised algorithm k-Nearest Neighbors or kNN, the user configurable
parameter ‘k’ can be used to do a trade-off between bias and variance. In one hand, when the
value of ‘k’ is decreased, the model becomes simpler to fit and bias increases. On the other hand,
when the value of ‘k’ is increased, the variance increases.
10 Cross-validation methods
• When the dataset is small, the method is prone to high variance. Due to the random
partition, the results can be entirely different for different test sets. To deal with this issue,
we use cross-validation to evaluate the performance of a machine-learning model.
• In cross-validation, we don’t divide the dataset into training and test sets only once. Instead,
we repeatedly partition the dataset into smaller groups and then average the performance
in each group. That way, we reduce the impact of partition randomness on the results.
• Many cross-validation techniques define different ways to divide the dataset at hand. We’ll
focus on the two most frequently used: the k-fold and the leave-one-out methods.
• Data Splitting: The dataset is divided into ’k’ subsets or folds, where each fold is used as
the test set exactly once, and the rest are used for training.
• K-Fold Variations: Variations include stratified K-Fold, which ensures that each fold has
a similar class distribution, and repeated K-Fold, where the process is repeated multiple
times with different random splits.
• Performance: The final model performance is typically determined by averaging the results
of all ’k’ iterations, such as mean accuracy or root mean squared error.
• Trade-Off: There’s a trade-off between computational cost and model assessment quality.
Larger ’k’ values lead to a more accurate assessment but require more computation.
• Usage: K-Fold Cross-Validation is widely used in various machine learning tasks, including
model selection, hyperparameter tuning, and performance estimation.
• Validation Set: In practice, a separate validation set might be used to validate the final
model after hyperparameter tuning, while K-Fold Cross-Validation helps in assessing the
overall performance of the model.
• Bias and Variance: It tends to produce a more reliable estimate of a model’s performance
as it reduces bias compared to other cross-validation methods like k-fold cross-validation.
However, LOO can have high variance due to its many iterations, making it computationally
expensive.
• Model Evaluation: LOO cross-validation allows you to assess how well the model gener-
alizes to unseen data and identify potential issues like overfitting or data leakage.
• Advantages: LOO is particularly useful when dealing with imbalanced datasets or when
each data point is scarce and valuable, as none are left out during training.
Comparison
An important factor when choosing between the k-fold and the LOO cross-validation methods is
the size of the dataset.
When the size is small, LOO is more appropriate since it will use more training samples in each
iteration. That will enable our model to learn better representations.
Conversely, we use k-fold cross-validation to train a model on a large dataset since LOO trains n
models, one per sample in the data. When our dataset contains a lot of samples, training so many
models will take too long. So, the k-fold cross-validation is more appropriate.
Also, in a large dataset, it is sufficient to use less than n folds since the test folds are large enough
for the estimates to be sufficiently precise.
11 Feedforward Neural Network (FNN)
Neural networks are a class of machine learning models inspired by the structure and function
of the human brain. Feedforward Neural Networks (FNN) represent the simplest form of neural
networks, where information travels in one direction, from the input layer to the output layer.
• Input Layer: This layer consists of neurons that receive inputs and pass them on to the
next layer. The number of neurons in the input layer is determined by the dimensions of the
input data.
• Hidden Layers: These layers are not exposed to the input or output and can be considered
as the computational engine of the neural network. Each hidden layer’s neurons take the
weighted sum of the outputs from the previous layer, apply an activation function, and pass
the result to the next layer. The network can have zero or more hidden layers.
• Output Layer: The final layer that produces the output for the given inputs. The number
of neurons in the output layer depends on the number of possible outputs the network is
designed to produce.
• Each neuron in one layer is connected to every neuron in the next layer, making this a
fully connected network. The strength of the connection between neurons is represented by
weights, and learning in a neural network involves updating these weights based on the error
of the output.
The input and hidden layers use sigmoid and linear activation functions whereas the output layer
uses a Heaviside step activation function at nodes because it is a two-step activation function that
helps in predicting results as per requirements. All units also known as neurons have weights and
calculation at the hidden layer is the summation of the dot product of all weights and their signals
and finally the sigmoid function of the calculated sum. Multiple hidden and output layer increases
the accuracy of the output.
• Activation Functions
Non-linear functions applied to the weighted sum to introduce non-linearity and enable the
network to learn complex patterns.
• Weights
Parameters that the network learns during training, determining the strength of connections
between neurons.
• Biases
Additional parameters that are added to the weighted sum before applying the activation
function, allowing the network to better fit the data.
2. Each neuron in the hidden layers processes the input using weights, biases, and activation
functions.
3. The output from each hidden layer is passed to the next layer.
4. This process continues until the output layer produces the final prediction.
• Feedforward Phase: In this phase, the input data is fed into the network, and it propagates
forward through the network. At each hidden layer, the weighted sum of the inputs is
calculated and passed through an activation function, which introduces non-linearity into
the model. This process continues until the output layer is reached, and a prediction is
made.
• Backpropagation Phase: Once a prediction is made, the error (difference between the
predicted output and the actual output) is calculated. This error is then propagated back
through the network, and the weights are adjusted to minimize this error. The process of
adjusting weights is typically done using a gradient descent optimization algorithm.
• Learning Rate
A hyperparameter that determines the step size in the weight and bias updates.
Overfitting
When a model performs well on the training data but poorly on new, unseen data.
Regularization Techniques
Methods like dropout and L2 regularization are employed to prevent overfitting by penalizing
overly complex models.
12.1 Architecture:
• Input Layer: The input layer is responsible for receiving data from the outside world. Each
neuron in the input layer corresponds to one feature, and the values from the dataset are
directly fed into these neurons.
• Hidden Layers: Between the input and output layers, there can be one or more hidden
layers. These layers contain neurons, also known as units or nodes, which are responsible for
learning complex patterns and relationships in the data. Hidden layers add the capacity to
model non-linear functions. An MLP can have a varying number of hidden layers and units,
depending on the problem’s complexity.
• Output Layer: The output layer is responsible for producing the final results or predictions.
The number of output neurons depends on the nature of the task. For instance, in binary
classification, there might be a single output neuron that outputs the probability of belonging
to one class, while in multi-class classification, there could be multiple output neurons, each
corresponding to a class.
12.2 Backpropagation
• Backpropagation is a technique used to optimize the weights of an MLP using the outputs
as inputs.
• In a conventional MLP, random weights are assigned to all the connections. These random
weights propagate values through the network to produce the actual output. Naturally, this
output would differ from the expected output. The difference between the two values is
called the error.
• Backpropagation refers to the process of sending this error back through the network, read-
justing the weights automatically so that eventually, the error between the actual and ex-
pected output is minimized.
• In this way, the output of the current iteration becomes the input and affects the next
output. This is repeated until the correct output is produced. The weights at the end of the
process would be the ones on which the neural network works correctly.
• Regularization: Regularization techniques like dropout, L1, and L2 regularization are used
to prevent overfitting.
13 Intro to UnSupervised Learning
• Unsupervised learning in artificial intelligence is a type of machine learning that learns from
data without human supervision.
• Unlike supervised learning, unsupervised machine learning models are given unlabeled data
and allowed to discover patterns and insights without any explicit guidance or instruction.
• Unsupervised learning is a machine learning paradigm where training examples lack labels,
and clustering prototypes are typically initialized randomly. The primary goal is to optimize
these cluster prototypes based on similarities among the training examples.
• Unsupervised learning is a machine learning paradigm that deals with unlabeled data and
aims to group similar data items into clusters. It differs from supervised learning, where
labeled data is used for classification or regression tasks. Unsupervised learning has applica-
tions in text clustering and other domains and can be adapted for supervised learning when
necessary.
• Unsupervised learning doesn’t refer to a specific algorithm but rather a general framework.
The process involves deciding on the number of clusters, initializing cluster prototypes, and
iteratively assigning data items to clusters based on similarity. These clusters are then
updated until convergence is achieved.
• Unsupervised learning algorithms have various real-world applications. For instance, they
can be used in text clustering, as seen in the application of algorithms like AHC and Kohonen
Networks to manage and cluster textual data. They can also be used for tasks like detecting
redundancy in national research projects.
13.1 Working
As the name suggests, unsupervised learning uses self-learning algorithms—they learn without
any labels or prior training. Instead, the model is given raw, unlabeled data and has to infer its
own rules and structure the information based on similarities, differences, and patterns without
explicit instructions on how to work with each piece of data.
Unsupervised learning algorithms are better suited for more complex processing tasks, such
as organizing large datasets into clusters. They are useful for identifying previously undetected
patterns in data and can help identify features useful for categorizing data.
Imagine that you have a large dataset about weather. An unsupervised learning algorithm will
go through the data and identify patterns in the data points. For instance, it might group data
by temperature or similar weather patterns.
While the algorithm itself does not understand these patterns based on any previous information
you provided, you can then go through the data groupings and attempt to classify them based on
your understanding of the dataset. For instance, you might recognize that the different temper-
ature groups represent all four seasons or that the weather patterns are separated into different
types of weather, such as rain, sleet, or snow.
1. Clustering
Clustering is a technique for exploring raw, unlabeled data and breaking it down into groups
(or clusters) based on similarities or differences. It is used in a variety of applications,
including customer segmentation, fraud detection, and image analysis. Clustering algorithms
split data into natural groups by finding similar structures or patterns in uncategorized data.
Clustering is one of the most popular unsupervised machine learning approaches. There are
several types of unsupervised learning algorithms that are used for clustering, which include
exclusive, overlapping, hierarchical, and probabilistic.
• Exclusive clustering: Data is grouped in a way where a single data point can only
exist in one cluster. This is also referred to as “hard” clustering. A common example of
exclusive clustering is the K-means clustering algorithm, which partitions data points
into a user-defined number K of clusters.
• Overlapping clustering: Data is grouped in a way where a single data point can
exist in two or more clusters with different degrees of membership. This is also referred
to as “soft” clustering.
• Hierarchical clustering: Data is divided into distinct clusters based on similarities,
which are then repeatedly merged and organized based on their hierarchical relation-
ships. There are two main types of hierarchical clustering: agglomerative and divisive
clustering. This method is also referred to as HAC—hierarchical cluster analysis.
• Probabilistic clustering: Data is grouped into clusters based on the probability of
each data point belonging to each cluster. This approach differs from the other methods,
which group data points based on their similarities to others in a cluster.
2. Association
Association rule mining is a rule-based approach to reveal interesting relationships between
data points in large datasets. Unsupervised learning algorithms search for frequent if-then
associations—also called rules—to discover correlations and co-occurrences within the data
and the different connections between data objects.
It is most commonly used to analyze retail baskets or transactional datasets to represent how
often certain items are purchased together. These algorithms uncover customer purchasing
patterns and previously hidden relationships between products that help inform recommen-
dation engines or other cross-selling opportunities. You might be most familiar with these
rules from the “Frequently bought together” and “People who bought this item also bought”
sections on your favorite online retail shop.
Association rules are also often used to organize medical datasets for clinical diagnoses.
Using unsupervised machine learning and association rules can help doctors identify the
probability of a specific diagnosis by comparing relationships between symptoms from past
patient cases.
Typically, Apriori algorithms are the most widely used for association rule learning to identify
related collections of items or sets of items. However, other types are used, such as Eclat
and FP-growth algorithms.
3. Dimensionality reduction
Dimensionality reduction is an unsupervised learning technique that reduces the number of
features, or dimensions, in a dataset. More data is generally better for machine learning,
but it can also make it more challenging to visualize the data.
Dimensionality reduction extracts important features from the dataset, reducing the number
of irrelevant or random features present. This method uses principle component analysis
(PCA) and singular value decomposition (SVD) algorithms to reduce the number of data
inputs without compromising the integrity of the properties in the original data.
13.3 Supervised learning vs. unsupervised learning
The main difference between supervised learning and unsupervised learning is the type of input
data that you use. Unlike unsupervised machine learning algorithms, supervised learning relies on
labeled training data to determine whether pattern recognition within a dataset is accurate.
The goals of supervised learning models are also predetermined, meaning that the type of
output of a model is already known before the algorithms are applied. In other words, the input
is mapped to the output based on the training data.
• Partitioning clustering
• Agglomerative clustering
• Divisive clustering
• K-Means clustering
There are many different fields where cluster analysis is used effectively, such as
• Text data mining: this includes tasks such as text categorization, text clustering, docu-
ment summarization, concept extraction, sentiment analysis, and entity relation modelling
• Data mining: simplify the data mining task by grouping a large number of features from
an extremely large data set to make the analysis manageable
14 K-Means Clustering
K-Medoids and K-Means are two types of clustering mechanisms in Partition Clustering. First,
Clustering is the process of breaking down an abstract group of data points/ objects into classes
of similar objects such that all the objects in one cluster have similar traits. , a group of n objects
is broken down into k number of clusters based on their similarities.
• It aims to find cluster centers (centroids) and assign data points to the nearest centroid
based on their similarity.
• K-means clustering is one of the simplest and popular unsupervised machine learning algo-
rithms.
• Typically, unsupervised algorithms make inferences from datasets using only input vectors
without referring to known, or labelled, outcomes.
• A cluster refers to a collection of data points aggregated together because of certain similar-
ities.
• K-means is a very simple to implement clustering algorithm that works by selecting k cen-
troids initially, where k serves as the input to the algorithm and can be defined as the number
of clusters required. The centroids serve as the center of each new cluster.
• We first assign each data point of the given data to the nearest centroid. Next, we calculate
the mean of each cluster and the means then serve as the new centroids. This step is repeated
until the positions of the centroids do not change anymore.
• The goal of k-means is to minimize the sum of the squared distances between each data
point and its centroid.
• The algorithm aims to minimize the sum of squared distances between data points and their
respective cluster centroids.
ni
K X
X
J= ∥xij − ci ∥2
i=1 j=1
where:
• Initialization: Choose the number of clusters (K) you want to create. Initialize K cluster
centroids randomly. These centroids can be selected from the data points or with random
values.
• Assignment Step: For each data point in the dataset, calculate the distance between the
data point and all K centroids. Assign the data point to the cluster associated with the
nearest centroid. This step groups data points into clusters based on similarity.
• Update Step: Recalculate the centroids for each cluster by taking the mean of all data
points assigned to that cluster. The new centroids represent the center of their respective
clusters.
• Convergence Check: Check if the algorithm has converged. Convergence occurs when the
centroids no longer change significantly between iterations. If convergence is not reached,
return to the Assignment and Update steps.
Purpose of WCSS
• Minimization of WCSS is the goal of the K-means algorithm. The algorithm iteratively tries
to adjust the positions of the centroids to minimize the WCSS.
• WCSS is also used to determine the optimal number of clusters using the Elbow Method,
which helps find a balance between the number of clusters and how well the data is clustered.
]
14.1 Methods to Determine the Optimal Number of Clusters (K) in
K-means Clustering
In K-means clustering, the number of clusters K needs to be specified before the algorithm runs.
Choosing the right value for K is crucial because it affects how well the data is clustered. Two
popular methods to determine the optimal number of clusters are:
If the curve has no clear elbow, choosing the right number of clusters can be subjective.
2. The Silhouette Method
The Silhouette Method measures how similar a data point is to its own cluster (cohesion) compared
to other clusters (separation). It computes the silhouette coefficient for each point, which quantifies
how well a point fits into its assigned cluster. The average silhouette score across all points can
be used to evaluate different values of K.
14.2 Comparison of Elbow and Silhouette Methods
14.3 K-medoids algorithm
• K-medoids is an unsupervised method with unlabelled data to be clustered. It is an impro-
vised version of the K-Means algorithm mainly designed to deal with outlier data sensitivity.
• Compared to other partitioning algorithms, the algorithm is simple, fast, and easy to imple-
ment.
• K-medoids clustering method but unlike k-means, rather than minimizing the sum of squared
distances, k-medoids works on minimizing the number of paired dissimilarities.
• We find this useful since k-medoids aims to form clusters where the objects within each
cluster are more similar to each other and dissimilar to objects in the other clusters. Instead
of centroids, this approach makes use of medoids.
• Medoids are points in the dataset whose sum of distances to other points in the cluster is
minimal.
There are three types of algorithms for K-Medoids Clustering: (No need to go in detailed)
PAM is the most powerful algorithm of the three algorithms but has the disadvantage of time
complexity. The following K-Medoids are performed using PAM. In the further parts, we’ll see
what CLARA and CLARANS are.
15 Hierarchical clustering
Till now, we have discussed the various methods for partitioning the data into different clusters.
But there are situations when the data needs to be partitioned into groups at different levels such
as in a hierarchy. The hierarchical clustering methods are used to group the data into hierarchy
or tree-like structure.
For example, in a machine learning problem of organizing employees of a university in different
departments, first the employees are grouped under the different departments in the university,
and then within each department, the employees can be grouped according to their roles such
as professors, assistant professors, supervisors, lab assistants, etc. This creates a hierarchical
structure of the employee data and eases visualization and analysis. Similarly, there may be a
data set which has an underlying hierarchy structure that we want to discover and we can use the
hierarchical clustering methods to achieve that.
• The hierarchical clustering produces clusters in which the clusters at each level of the hier-
archy are created by merging clusters at the next lower level.
• At the lowest level, each cluster contains a single observation. At the highest level there is
only one cluster containing all of the data.
• The decision regarding whether two clusters are to be merged or not is taken based on the
measure of dissimilarity between the clusters. The distance between two clusters is usually
taken as the measure of dissimilarity between the clusters.
• The choice of linkage method and distance metric can significantly impact the results and
the structure of the dendrogram.
• Dendrograms are useful for visualizing the hierarchy and helping to decide how many clusters
are appropriate for a particular application.
• Hierarchical clustering can be computationally intensive, especially for large datasets, and
may not be suitable for such cases.
15.1 Hierarchical Clustering Methods
There are two main hierarchical clustering methods: agglomerative clustering and divisive
clustering.
Agglomerative clustering is a bottom-up technique which starts with individual objects as
clusters and then iteratively merges them to form larger clusters. On the other hand, the divisive
method starts with one cluster with all given objects 2 and then splits it iteratively to form
smaller clusters.
In both these cases, it is important to select the split and merger points carefully, because the
subsequent splits or mergers will use the result of the previous ones and there is no option to
perform any object swapping between the clusters or rectify the decisions made in previous steps,
which may result in poor clustering quality at the end.
15.1.1 Dendrogram
• Hierarchical clustering can be represented by a rooted binary tree. The nodes of the trees
represent groups or clusters. The root node represents the entire data set. The terminal
nodes each represent one of the individual observations (singleton clusters). Each nontermi-
nal node has two daughter nodes.
• The distance between merged clusters is monotone increasing with the level of the merger.
The height of each node above the level of the terminal nodes in the tree is proportional to
the value of the distance between its two daughters (see Figure 13.9).
• A dendrogram is a tree diagram used to illustrate the arrangement of the clusters produced
by hierarchical clustering.
• The dendrogram may be drawn with the root node at the top and the branches growing
vertically downwards (see Figure 13.8(a)).
• It may also be drawn with the root node at the left and the branches growing horizontally
rightwards (see Figure 13.8(b)).
Example Figure 13.7 is a dendrogram of the dataset {a, b, c, d, e}. Note that the root node rep-
resents the entire dataset and the terminal nodes represent the individual observations. However,
the dendrograms are presented in a simplified format in which only the terminal nodes (that is, the
nodes representing the singleton clusters) are explicitly displayed. Figure 13.8 shows the simplified
format of the dendrogram in Figure 13.7. Figure 13.9 shows the distances of the clusters at the
various levels. Note that the clusters are at 4 levels. The distance between the clusters {a} and
{b} is 15, between {c} and {d} is 7.5, between {c, d} and {e} is 15 and between {a, b} and {c, d, e}
is 25.
Figure 9: Figure 13.9: A dendrogram of the dataset a, b, c, d, e showing the distances (heights)
of the clusters at different levels
• Step 1: Initialization Start with each data point as a separate cluster. If you have N data
points, you initially have N clusters.
• Step 2: Merge Clusters Calculate the pairwise distances (similarity or dissimilarity) between
all clusters. Common distance metrics include Euclidean distance, Manhattan distance, or
other similarity measures. Merge the two closest clusters based on the distance metric.
There are various linkage methods to define cluster distance:
– Single Linkage: Merge clusters based on the minimum distance between any pair of
data points from the two clusters.
– Complete Linkage: Merge clusters based on the maximum distance between any pair
of data points from the two clusters.
– Average Linkage: Merge clusters based on the average distance between data points
in the two clusters.
– Ward’s Method: Minimize the increase in variance when merging clusters. Repeat
this merging process iteratively until all data points are in a single cluster or until you
reach the desired number of clusters.
• Step 3: Dendrogram During the merging process, create a dendrogram, which is a tree-
like structure that represents the hierarchy of clusters. The dendrogram provides a visual
representation of how clusters merge and shows the relationships between clusters at different
levels of granularity.
• Step 4: Cutting the Dendrogram To determine the number of clusters, you can cut the
dendrogram at a specific level. The height at which you cut the dendrogram corresponds to
the number of clusters you obtain. The cut produces the final clusters at the chosen level of
granularity.
• Step 5: Results The resulting clusters are obtained based on the cut level. Each cluster
contains a set of data points that are similar to each other according to the chosen linkage
method.
For example, the hierarchical clustering shown in Figure 13.7 can be constructed by the ag-
glomerative method as shown in Figure 13.10. Each nonterminal node has two daughter nodes.
The daughters represent the two groups that were merged to form the parent.
Figure 10: Figure 13.10: Hierarchical clustering using agglomerative method
15.1.3 Divisive Hierarchical Clustering (Top-Down)
• Top-down hierarchical clustering, also known as divisive hierarchical clustering, is a clus-
tering algorithm that starts with all data points in a single cluster and recursively divides
clusters into smaller sub-clusters until each data point forms its own individual cluster or a
specified number of clusters is reached.
• The divisive hierarchical clustering method uses a top-down strategy. The starting point is
the largest cluster with all the objects in it, and then, it is split recursively to form smaller
and smaller clusters, thus forming the hierarchy. The end of iterations is achieved when the
objects in the final clusters are sufficiently homogeneous to each other or the final clusters
contain only one object or the user-defined clustering condition is achieved.
• The divisive method starts at the top and at each level recursively split one of the existing
clusters at that level into two new clusters.
• If there are N observations in the dataset, there the divisive method also will produce N − 1
levels in the hierarchy. The split is chosen to produce two new groups with the largest
“between-group dissimilarity”.
– Step 1: Initialization
Start with all data points as members of a single cluster. If you have N data points,
you initially have one cluster with N members.
– Step 2: Split Clusters
∗ Calculate the within-cluster variance or a similar measure for the current cluster.
This measure represents the compactness of the cluster.
∗ Divide the cluster into two sub-clusters in a way that minimizes the within-cluster
variance. There are various methods to achieve this, such as k-means or hierarchical
clustering on the sub-cluster.
∗ Repeat the splitting process recursively for each sub-cluster until you reach the
desired number of clusters.
– Step 3: Dendrogram (Optional)
While divisive hierarchical clustering often doesn’t produce a dendrogram like agglom-
erative clustering, you can still record the hierarchy of cluster splits for analysis if
needed.
– Step 4: Results
The resulting clusters are obtained based on the recursive splits. Each cluster contains
a set of data points that are similar to each other according to the splitting criteria.
For example, the hierarchical clustering shown in Figure 13.7 can be constructed by the divisive
method as shown in Figure 13.11. Each nonterminal node has two daughter nodes. The two
daughters represent the two groups resulting from the split of the parent.
Figure 11: Figure 13.11: Hierarchical clustering using divisive method
In both these cases, it is important to select the split and merger points carefully, because
the subsequent splits or mergers will use the result of the previous ones and there is no option to
perform any object swapping between the clusters or rectify the decisions made in previous steps,
which may result in poor clustering quality at the end.
Often the distance measure is used to decide when to terminate the clustering algorithm. For
example, in an agglomerative clustering, the merging iterations may be stopped once the MIN
distance between two neighboring clusters becomes less than the user-defined threshold.
So, when an algorithm uses the minimum distance Dmin to measure the distance between the
clusters, then it is referred to as nearest neighbour clustering algorithm, and if the decision to
stop the algorithm is based on a user-defined limit on Dmin , then it is called a single linkage
algorithm.
On the other hand, when an algorithm uses the maximum distance Dmax to measure the
distance between the clusters, then it is referred to as furthest neighbour clustering algorithm,
and if the decision to stop the algorithm is based on a userdefined limit on Dmax then it is called
complete linkage algorithm.
As minimum and maximum measures provide two extreme options to measure distance between
the clusters, they are prone to the outliers and noisy data. Instead, the use of mean and average
distance helps in avoiding such problem and provides more consistent results.
2. Cluster Distance: Calculate the pairwise distances between all clusters. The distance
between two clusters is defined as the minimum distance between any two data points, one
from each cluster.
3. Merge Clusters: Merge the two clusters with the shortest distance, as defined by single
linkage. This creates a new, larger cluster.
4. Repeat: Continue steps 2 and 3 iteratively until all data points are part of a single cluster,
or you reach a specified number of clusters.
5. Dendrogram: During the process, create a dendrogram, which is a tree-like structure that
represents the hierarchy of clusters. It records the sequence of cluster mergers.
6. Cutting the Dendrogram: To determine the number of clusters, you can cut the dendro-
gram at a specific level. The height at which you cut the dendrogram corresponds to the
number of clusters you obtain.
• It is sensitive to outliers and noise because a single close pair of points from different clusters
can cause a merger.
• It tends to create elongated clusters, as it connects clusters based on single, nearest neighbors.
• Single linkage is just one of several linkage methods used in hierarchical clustering, each with
its own strengths and weaknesses. The choice of linkage method depends on the nature of
the data and the desired clustering outcome.
1. Initialization: Start with each data point as an individual cluster. If you have N data
points, you initially have N clusters.
2. Cluster Distance: Calculate the pairwise distances between all clusters. The distance
between two clusters is defined as the maximum distance between any two data points, one
from each cluster.
3. Merge Clusters: Merge the two clusters with the shortest (maximum) distance, as defined
by complete linkage. This creates a new, larger cluster.
4. Repeat: Continue steps 2 and 3 iteratively until all data points are part of a single cluster
or you reach a specified number of clusters.
6. Cutting the Dendrogram: To determine the number of clusters, you can cut the dendro-
gram at a specific level. The height at which you cut the dendrogram corresponds to the
number of clusters you obtain.
• It tends to produce compact, spherical clusters since it minimizes the maximum distance
within clusters.
• It is less sensitive to outliers than single linkage because it focuses on the maximum distance.
The choice of linkage method (single, complete, average, etc.) depends on the nature of the
data and the desired clustering outcome. Different linkage methods can produce different cluster
structures based on how distance is defined between clusters.
16 Dimensionality Reduction
Dimensionality reduction is a technique used in machine learning and data analysis to reduce
the number of features (dimensions) in a dataset. It involves transforming a high-dimensional
dataset into a lower-dimensional representation while retaining the most relevant information.
Dimensionality reduction is useful for several reasons:
• Curse of Dimensionality: High-dimensional data can suffer from the curse of dimension-
ality, where the data becomes sparse and noisy, making it challenging to analyze and model
effectively. Reducing dimensionality can help mitigate this issue.
Popular algorithms used for dimensionality reduction include principal component analysis
(PCA).These algorithms seek to transform data from high-dimensional spaces to low-dimensional
spaces without compromising meaningful properties in the original data. These techniques are
typically deployed during exploratory data analysis (EDA) or data processing to prepare the data
for modeling.
It’s helpful to reduce the dimensionality of a dataset during EDA to help visualize data: this
is because visualizing data in more than three dimensions is difficult. From a data processing
perspective, reducing the dimensionality of the data simplifies the modeling problem.
16.1 Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a widely used dimensionality reduction technique in the
fields of statistics, machine learning, and data analysis. It aims to transform high-dimensional
data into a lower-dimensional representation while retaining the most important information and
patterns.
PCA achieves this by finding a set of orthogonal axes, known as principal components, in the
high-dimensional data space. Each principal component is a linear combination of the original
features.
Here’s a step-by-step explanation of how PCA works:
1. Data Standardization: Start with a dataset of high-dimensional data, where each column
represents a feature or variable. Standardize the data by subtracting the mean and dividing
by the standard deviation for each feature. This step ensures that all features have the same
scale.
2. Covariance Matrix Calculation: Calculate the covariance matrix of the standardized
data. The covariance matrix is a square matrix where each element represents the covariance
between pairs of features.
3. Eigenvalue Decomposition: Perform eigenvalue decomposition on the covariance matrix
to extract eigenvalues and eigenvectors. The eigenvalues represent the amount of variance
explained by each principal component. The eigenvectors represent the directions in the
high-dimensional space that maximize the variance.
4. Sort Eigenvalues: Order the eigenvalues in descending order. The first eigenvalue corre-
sponds to the most significant variance, the second eigenvalue to the second-most significant
variance, and so on.
5. Select Principal Components: Decide how many principal components you want to
retain in the lower-dimensional representation. This choice can be based on the proportion
of explained variance or specific requirements for dimensionality reduction.
6. Projection: Use the selected principal components (eigenvectors) to transform the data.
Each data point is projected onto the subspace defined by these principal components. This
transformation results in a lower-dimensional representation of the data.
7. Variance Explained: Calculate the proportion of total variance explained by the retained
principal components. This information can help you assess the quality of the dimensionality
reduction.
8. Visualization and Analysis: Visualize and analyze the lower-dimensional data to gain
insights, identify patterns, or facilitate further data analysis. Principal components can be
interpreted to understand the relationships between features in the original data.
9. Inverse Transformation (Optional): If necessary, you can perform an inverse transfor-
mation to map the reduced-dimensional data back to the original high-dimensional space.
However, some information may be lost in this process.
10. Application: Use the lower-dimensional data for various tasks, such as visualization, clus-
tering, classification, or regression, with reduced computational complexity and noise.
PCA provides several benefits:
• Noise Reduction: PCA can help filter out noise in the data, leading to cleaner and more
interpretable patterns.
• Feature Engineering: PCA can serve as a form of feature engineering, identifying the
most important features for a given task.
• Linear Discriminant Analysis (LDA) is a supervised learning algorithm primarily used for
classification tasks and dimensionality reduction. Unlike unsupervised methods like Prin-
cipal Component Analysis (PCA), LDA leverages class label information to maximize the
separability between multiple classes.
• LDA was introduced by Ronald A. Fisher in 1936, primarily for distinguishing between
species of Iris flowers. Fisher’s work laid the groundwork for subsequent developments in
statistical pattern recognition and machine learning.
Theoretical Framework
At its core, LDA seeks to find a linear combination of features that best separates two or more
classes of objects or events. The key idea is to project high-dimensional data onto a lower-
dimensional space while preserving class discriminatory information.
• Maximize Between-Class Variance: Ensures that different classes are as far apart as
possible.
• Minimize Within-Class Variance: Ensures that data points within the same class are
as close to each other as possible.
Assumptions
LDA operates under several key assumptions:
• Equal Covariance Matrices: All classes share the same covariance matrix, implying that they
have similar shapes in feature space.
Violations of these assumptions can affect the performance and reliability of LDA.
Please watch the lecture for better understanding
17 Performance Metrics
Performance metrics are used to evaluate how well a model performs, typically by comparing its
predictions against actual outcomes. The choice of metric depends on the problem type (e.g.,
classification, regression) and the goals of the model.
• For imbalanced classes, accuracy can be misleading, and metrics like precision, recall, and
F1-score are often more informative.
• In regression tasks, MAE, MSE, and R² are commonly used, but the choice depends on how
sensitive the model should be to large errors.
• For clustering, a combination of internal and external validation metrics helps to evaluate
the quality of the clustering.
References