UNIT VI PARAMETRIC MACHINE LEARNING
Logistic Regression: Classification and representation – Cost function – Gradient descent –
Advanced optimization – Regularization - Solving the problems on overfitting. Perceptron –
Neural Networks – Multi – class Classification - Backpropagation – Non-linearity with
activation functions (Tanh, Sigmoid, Relu, PRelu) - Dropout as regularization
Logistic Regression
What are the differences between supervised learning, unsupervised learning & reinforcement
learning?
Machine learning algorithms are broadly classified into three categories – supervised learning,
unsupervised learning, and reinforcement learning.
1. Supervised Learning - Learning where data is labeled and the motivation is to classify
something or predict a value. Example: Detecting fraudulent transactions from a list of credit
card transactions.
2. Unsupervised Learning - Learning where data is not labeled and the motivation is to find
patterns in given data. In this case, you are asking the machine learning model to process the data
from which you can then draw conclusions. Example: Customer segmentation based on spend
data.
3. Reinforcement Learning - Learning by trial and error. This is the closest to how humans
learn. The motivation is to find optimal policy of how to act in a given environment. The
machine learning model examines all possible actions, makes a policy that maximizes benefit,
and implements the policy(trial). If there are errors from the initial policy, apply reinforcements
back into the algorithm and continue to do this until you reach the optimal policy. Example:
Personalized recommendations on streaming platforms like YouTube.
What are the two types of supervised learning?
As supervised learning is used to classify something or predict a value, naturally there are two
types of algorithms for supervised learning - classification models and regression models.
1. Classification model - In simple terms, a classification model predicts possible outcomes.
Example: Predicting if a transaction is fraud or not.
2. Regression model - Are used to predict a numerical value. Example: Predicting the sale price
of a house.
What is logistic regression?
MC4301 MACHINE LEARINIG
Logistic regression is an example of supervised learning. It is used to calculate or predict the
probability of a binary (yes/no) event occurring. An example of logistic regression could be
applying machine learning to determine if a person is likely to be infected with COVID-19 or
not. Since we have two possible outcomes to this question - yes they are infected, or no they are
not infected - this is called binary classification.
In this imaginary example, the probability of a person being infected with COVID-19 could be
based on the viral load and the symptoms and the presence of antibodies, etc. Viral load,
symptoms, and antibodies would be our factors (Independent Variables), which would influence
our outcome (Dependent Variable).
How is logistic regression different from linear regression?
In linear regression, the outcome is continuous and can be any possible value. However, in the
case of logistic regression, the predicted outcome is discrete and restricted to a limited number of
values.
For example, say we are trying to apply machine learning to the sale of a house. If we are trying
to predict the sale price based on the size, year built, and number of stories we would use linear
regression, as linear regression can predict a sale price of any possible value. If we are using
those same factors to predict if the house sells or not, we would logistic regression as the
possible outcomes here are restricted to yes or no.
Hence, linear regression is an example of a regression model and logistic regression is an
example of a classification model.
Where to use logistic regression?
Logistic regression is used to solve classification problems, and the most common use case is
binary logistic regression, where the outcome is binary (yes or no). In the real world, you can see
logistic regression applied across multiple areas and fields.
In health care, logistic regression can be used to predict if a tumor is likely to be benign or
malignant.
In the financial industry, logistic regression can be used to predict if a transaction is fraudulent or
not.
In marketing, logistic regression can be used to predict if a targeted audience will respond or not.
The three types of logistic regression
Binary logistic regression - When we have two possible outcomes, like our original example of
whether a person is likely to be infected with COVID-19 or not.
Multinomial logistic regression - When we have multiple outcomes, say if we build out our
original example to predict whether someone may have the flu, an allergy, a cold, or COVID-19.
MC4301 MACHINE LEARINIG
Ordinal logistic regression - When the outcome is ordered, like if we build out our original
example to also help determine the severity of a COVID-19 infection, sorting it into mild,
moderate, and severe cases.
Mathematics behind logistic regression
Probability always ranges between 0 (does not happen) and 1 (happens). Using our Covid-19
example, in the case of binary classification, the probability of testing positive and not testing
positive will sum up to
We use logistic function or sigmoid function to calculate probability in logistic regression. The
logistic function is a simple S-shaped curve used to convert data into a value between 0 and 1.
Classification and representation
What is Supervised Learning?
In Supervised Learning, the model learns by example. Along with our input variable, we also
give our model the corresponding correct labels. While training, the model gets to look at which
label corresponds to our data and hence can find patterns between our data and those labels.
Some examples of Supervised Learning include:
1. It classifies spam Detection by teaching a model of what mail is spam and not spam.
2. Speech recognition where you teach a machine to recognize your voice
3. Object Recognition by showing a machine what an object looks like and having it pick that
object from among other objects.
We can further divide Supervised Learning into the following:
MC4301 MACHINE LEARINIG
Figure 1: Supervised Learning Subdivisions
What is Classification?
Classification is defined as the process of recognition, understanding, and grouping of objects
and ideas into preset categories a.k.a “sub-populations.” With the help of these pre-categorized
training datasets, classification in machine learning programs leverage a wide range of
algorithms to classify future datasets into respective and relevant categories.
Classification algorithms used in machine learning utilize input training data for the purpose of
predicting the likelihood or probability that the data that follows will fall into one of the
predetermined categories. One of the most common applications of classification is for filtering
emails into “spam” or “non-spam”, as used by today’s top email service providers.
In short, classification is a form of “pattern recognition,”. Here, classification algorithms applied
to the training data find the same pattern (similar number sequences, words or sentiments, and
the like) in future data sets.
We will explore classification algorithms in detail, and discover how a text analysis software can
perform actions like sentiment analysis - used for categorizing unstructured text by opinion
polarity (positive, negative, neutral, and the like).
What is Classification Algorithm?
MC4301 MACHINE LEARINIG
Based on training data, the Classification algorithm is a Supervised Learning technique used to
categorize new observations. In classification, a program uses the dataset or observations
provided to learn how to categorize new observations into various classes or groups. For
instance, 0 or 1, red or blue, yes or no, spam or not spam, etc. Targets, labels, or categories can
all be used to describe classes. The Classification algorithm uses labeled input data because it is
a supervised learning technique and comprises input and output information. A discrete output
function (y) is transferred to an input variable in the classification process (x).
In simple words, classification is a type of pattern recognition in which classification algorithms
are performed on training data to discover the same pattern in new data sets.
Learners in Classification Problems
There are two types of learners.
Lazy Learners
It first stores the training dataset before waiting for the test dataset to arrive. When using a lazy
learner, the classification is carried out using the training dataset's mostappropriate data. Less
time is spent on training, but more time is spent on predictions.
Some of the examples are case-based reasoning and the KNN algorithm.
Eager Learners
Before obtaining a test dataset, eager learners build a classification model using a training
dataset. They spend more time studying and less time predicting. Some of the examples are
ANN, naive Bayes, and Decision trees
Types Of Classification Tasks In Machine Learning
Before diving into the four types of Classification Tasks in Machine Learning, let us first discuss
Classification Predictive Modeling.
Classification Predictive Modeling
A classification problem in machine learning is one in which a class label is anticipated for a
specific example of input data.
Problems with categorization include the following:
Give an example and indicate whether it is spam or not.
Identify a handwritten character as one of the recognized characters.
Determine whether to label the current user behavior as churn.
A training dataset with numerous examples of inputs and outputs is necessary for classification
from a modeling standpoint.
MC4301 MACHINE LEARINIG
A model will determine the optimal way to map samples of input data to certain class labels
using the training dataset. The training dataset must therefore contain a large number of samples
of each class label and be suitably representative of the problem.
When providing class labels to a modeling algorithm, string values like "spam" or "not spam"
must first be converted to numeric values. Label encoding, which is frequently used, assigns a
distinct integer to every class label, such as "spam" = 0, "no spam," = 1.
There are numerous varieties of algorithms for classification in modeling problems, including
predictive modeling and classification.
It is typically advised that a practitioner undertake controlled tests to determine what algorithm
and algorithm configuration produces the greatest performance for a certain classification task
because there is no strong theory on how to map algorithms onto issue types.
Based on their output, classification predictive modeling algorithms are assessed. A common
statistic for assessing a model's performance based on projected class labels is classification
accuracy. Although not perfect, classification accuracy is a reasonable place to start for many
classification jobs.
Some tasks may call for a class membership probability prediction for each example rather than
class labels. This adds more uncertainty to the prediction, which a user or application can
subsequently interpret. The ROC Curve is a well-liked diagnostic for assessing anticipated
probabilities
There are four different types of Classification Tasks in Machine Learning and they are
following -
Binary Classification
Multi-Class Classification
Multi-Label Classification
Imbalanced Classification
Binary Classification
Those classification jobs with only two class labels are referred to as binary classification.
Examples comprise -
Prediction of conversion (buy or not).
Churn forecast (churn or not).
Detection of spam email (spam or not).
Binary classification problems often require two classes, one representing the normal state and
the other representing the aberrant state.
MC4301 MACHINE LEARINIG
For instance, the normal condition is "not spam," while the abnormal state is "spam." Another
illustration is when a task involving a medical test has a normal condition of "cancer not
identified" and an abnormal state of "cancer detected."
Class label 0 is given to the class in the normal state, whereas class label 1 is given to the class in
the abnormal condition.
A model that forecasts a Bernoulli probability distribution for each case is frequently used to
represent a binary classification task.
The discrete probability distribution known as the Bernoulli distribution deals with the situation
where an event has a binary result of either 0 or 1. In terms of classification, this indicates that
the model forecasts the likelihood that an example would fall within class 1, or the abnormal
state.
The following are well-known binary classification algorithms:
Logistic Regression
Support Vector Machines
Simple Bayes
Decision Trees
Some algorithms, such as Support Vector Machines and Logistic Regression, were created
expressly for binary classification and do not by default support more than two classes
Multi-Class Classification
Multi-class labels are used in classification tasks referred to as multi-class classification.
Examples comprise –
Categorization of faces.
Classifying plant species.
MC4301 MACHINE LEARINIG
Character recognition using optical.
The multi-class classification does not have the idea of normal and abnormal outcomes, in
contrast to binary classification. Instead, instances are grouped into one of several well-known
classes.
In some cases, the number of class labels could be rather high. In a facial recognition system, for
instance, a model might predict that a shot belongs to one of thousands or tens of thousands of
faces.
Text translation models and other problems involving word prediction could be categorized as a
particular case of multi-class classification. Each word in the sequence of words to be predicted
requires a multi-class classification, where the vocabulary size determines the number of
possible classes that may be predicted and may range from tens of thousands to hundreds of
thousands of words Multiclass classification tasks are frequently modeled using a model that
forecasts a Multinoulli probability distribution for each example.
An event that has a categorical outcome, such as K in 1, 2, 3,..., K, is covered by the Multinoulli
distribution, which is a discrete probability distribution. In terms of classification, this implies
that the model forecasts the likelihood that a given example will belong to a certain class label.
For multi-class classification, many binary classification techniques are applicable.
The following well-known algorithms can be used for multi-class classification:
Progressive Boosting
Choice trees
Nearest K Neighbors
Rough Forest
Simple Bayes
MC4301 MACHINE LEARINIG
Multi-class problems can be solved using algorithms created for binary classification. In order to
do this, a method is known as "one-vs-rest" or "one model for each pair of classes" is used,
which includes fitting multiple binary classification models with each class versus all other
classes (called one-vs-one).
One-vs-One: For each pair of classes, fit a single binary classification model.
The following binary classification algorithms can apply these multi-class classification
techniques:
One-vs-Rest: Fit a single binary classification model for each class versus all other classes.
The following binary classification algorithms can apply these multi-class classification
techniques:
Support vector Machine
Logistic Regression
MC4301 MACHINE LEARINIG
Multi-Label Classification
Multi-label classification problems are those that feature two or more class labels and allow for
the prediction of one or more class labels for each example.
Think about the photo classification example. Here a model can predict the existence of many
known things in a photo, such as “person”, “apple”, "bicycle," etc. A particular photo may have
multiple objects in the scene.
This greatly contrasts with multi-class classification and binary classification, which anticipate a
single class label for each occurrence Think about the photo classification example. Here a
model can predict the existence of many known things in a photo, such as “person”, “apple”,
"bicycle," etc. A particular photo may have multiple objects in the scene.
This greatly contrasts with multi-class classification and binary classification, which anticipate a
single class label for each occurrence Think about the photo classification example. Here a
model can predict the existence of many known things in a photo, such as “person”, “apple”,
"bicycle," etc. A particular photo may have multiple objects in the scene.
This greatly contrasts with multi-class classification and binary classification, which anticipate a
single class label for each occurrence.
Multi-label classification problems are frequently modeled using a model that forecasts many
outcomes, with each outcome being forecast as a Bernoulli probability distribution. In essence,
this approach predicts several binary classifications for each example.
It is not possible to directly apply multi-label classification methods used for multi-class or
binary classification. The so-called multi-label versions of the algorithms, which are specialized
versions of the conventional classification algorithms, include:
Multi-label Gradient Boosting
Multi-label Random Forests
MC4301 MACHINE LEARINIG
Multi-label Decision Trees
Another strategy is to forecast the class labels using a different classification .
Imbalanced Classification
The term "imbalanced classification" describes classification jobs where the distribution of
examples within each class is not equal.
A majority of the training dataset's instances belong to the normal class, while a minority belong
to the abnormal class, making imbalanced classification tasks binary classification tasks in
general.
Examples comprise –
Clinical diagnostic procedures
Detection of outliers
Fraud investigation
MC4301 MACHINE LEARINIG
Types of classification algorithms
There are many different types of classification algorithms. While they have overlapping
use cases, some are more suited to particular applications than others. Some of the most popular
classification algorithms include:
Logistic regression
Decision tree
Random forest
Support vector machine (SVM)
K-nearest neighbors
Naive Bayes
Many of these algorithms can be readily implemented in Python with the use of scikit-
learn libraries. Meanwhile, ensemble methods and transformer models are newer developments
being applied to classification problems.
Logistic regression
Logistic regression algorithms are often used to perform classification tasks. Logistic
regression is a probability classifier derived from linear regression models. Linear regression
uses one or more independent variables to predict the value of an independent variable. This
value can be any continuous rational number.
Logistic regression is a modification to the linear regression such as the output value (or
independent variable) is limited to any value between 0 and 1. It does this by applying a logit—
or log odds—transformation to the standard linear regression formula.4
MC4301 MACHINE LEARINIG
Logistic regression models are used for binary classification in multivariate regression
problems: when considering multiple variables, does the data point belong to one category or the
other? Common applications are fraud detection and biomedical predictions. For instance,
logistic regression has been implemented to help predict patient mortality induced by trauma and
coronary heart disease.
Decision tree
Used for both classification and regression, decision trees split datasets into progressively
smaller groups in a series of binary classification judgments. The resulting structure resembles a
tree, branching outward from an initial judgment into subsequent leaves or nodes.
The flowchart-like nature of decision trees makes them one of the more intuitive models
for business users to understand. Easy to visualize, decision trees bring transparency to the
classification process by clearly representing the decision processes and criteria used to
categorize data.
Random forest
The random forest is an ensemble technique combining the output of multiple decision
trees into a single result. The resulting “forest” improves prediction accuracy over that of a single
tree while countering overfitting. Like decision trees, random forests can handle both
classification and regression tasks.
MC4301 MACHINE LEARINIG
Random forest algorithms create multiple decision trees for each task, aggregate the
prediction of all the trees, then choose the most popular answer as the definitive result. Each tree
considers a random subset of data features, helping ensure low correlation between trees.
Support vector machine (SVM)
Support vector machine (SVM) algorithms plot data points into a multidimensional
space, with the number of dimensions corresponding to the number of features in the data. The
algorithm’s goal is to discover the optimal line—also known as a hyperplane or decision
boundary—that best divides the data points into categories.
The optimal hyperplane is the one with the widest margin, which is the distance between
the hyperplane and the nearest data points in each class. These nearby data points are known as
support vectors. Models that separate data with a hyperplane are linear models, but SVM
algorithms can also handle nonlinear classification tasks with more complex datasets.
Logistic regression, decision trees, random forests and SVM algorithms are all examples
of eager learners: algorithms that construct models from training data and then apply those
models to future predictions. Training takes longer, but after the algorithm builds a good model,
predictions are quicker.
K-nearest neighbors (KNN)
K-nearest neighbors (KNN) algorithms map data points onto a multidimensional space. It
then groups those data points with similar feature values into separate groups, or classes. To
classify new data samples, the classifier looks at the k number of points nearest to the new data,
counts the members of each class comprising the neighboring subset, and returns that proportion
as the class estimate for the new data point.
MC4301 MACHINE LEARINIG
In other words, the model assigns a new data point to whichever class comprises the
majority of that point’s neighbors. KNN models are lazy learners: algorithms that don’t
immediately build a model from training data, but instead refer to training data and compare new
data to it. It typically takes longer for these models to make predictions than eager learners.
KNN models typically compare distance between data points with Euclidean distance:6
Approximate nearest neighbor (ANN) is a variant of KNN. In high-dimensional data
spaces, it is computationally expensive to find a data point’s exact neighbors. Dimensionality
reduction and ANN are two solutions to this issue.
Rather than find a data point’s exact nearest neighbor, ANN finds an approximate nearest
neighbor within a given distance. Recent research has shown promising results for ANN in the
context of multilabel classification.7
Naive Bayes
Based on Bayes’ theorem, Naive Bayes classifiers calculate posterior probability for class
predictions. Naive Bayes updates initial class predictions, or prior probabilities, with each new
piece of data.
With a diabetes predictor, the patient’s medical data—blood pressure, age, blood sugar
levels, and more—are the independent variables. A Bayesian classifier combines the current
prevalence of diabetes across a population (prior probability) with the conditional probability of
the patient’s medical data values appearing in someone with diabetes.
Naive Bayes classifiers follow the Bayes’ Rule equation:
Naive Bayes is known as a generative classifier. By using an observation’s variable
values, the Bayesian classifier calculates which class is most likely to have generated the
observation.
Natural language processing (NLP) researchers have widely applied Naïve Bayes for text
classification tasks, such as sentiment analysis. Using a bag of words model, in which each word
MC4301 MACHINE LEARINIG
constitutes a variable, the Naive Bayes classifier predicts whether a positive or negative class
produced the text in question.
Types of ML Classification Algorithms
1. Supervised Learning Approach
The supervised learning approach explicitly trains algorithms under close human
supervision. Both the input and the output data are first provided to the algorithm. The
algorithm then develops rules that map the input to the output. The training procedure
is repeated as soon as the highest level of performance is attained.
The two types of supervised learning approaches are:
Regression
Classification
2. Unsupervised Learning
This approach is applied to examine data's inherent structure and derive insightful
information from it. This technique looks for insights that can produce better results
by looking for patterns and insights in unlabeled data.
There are two types of unsupervised learning:
Clustering
Dimensionality reduction
3. Semi-supervised Learning
Semi-supervised learning lies on the spectrum between unsupervised and supervised
learning. It combines the most significant aspects of both worlds to provide a unique
set of algorithms.
4. Reinforcement Learning
The goal of reinforcement learning is to create autonomous, self-improving
algorithms. The algorithm's goal is to improve itself through a continual cycle of trials
MC4301 MACHINE LEARINIG
and errors based on the interactions and combinations between the incoming and
labeled data.
Classification Models
Naive Bayes: Naive Bayes is a classification algorithm that assumes that predictors in a dataset
are independent. This means that it assumes the features are unrelated to each other. For
example, if given a banana, the classifier will see that the fruit is of yellow color, oblong-shaped
and long and tapered. All of these features will contribute independently to the probability of it
being a banana and are not dependent on each other.
Decision Trees: A Decision Tree is an algorithm that is used to visually represent decision-
making. A Decision Tree can be made by asking a yes/no question and splitting the answer to
lead to another decision. The question is at the node and it places the resulting decisions below at
the leaves. The tree depicted below is used to decide if we can play tennis.
Figure 4: Decision Tree
In the above figure, depending on the weather conditions and the humidity and wind,
we can systematically decide if we should play tennis or not. In decision trees, all the
False statements lie on the left of the tree and the True statements branch off to the
right. Knowing this, we can make a tree which has the features at the nodes and the
resulting classes at the leaves.
K-Nearest Neighbors: K-Nearest Neighbor is a classification and prediction algorithm that is
used to divide data into classes based on the distance between the data points. K-Nearest
Neighbor assumes that data points which are close to one another must be similar and hence, the
data point to be classified will be grouped with the closest cluster.
MC4301 MACHINE LEARINIG
Figure 5: Data to be classified
Figure 6: Classification using K-Nearest
Neighbours
Evaluating a Classification Model
After our model is finished, we must assess its performance to determine whether it is
a regression or classification model. So, we have the following options for assessing a
classification model:
1. Confusion Matrix
The confusion matrix describes the model performance and gives us a matrix or table as an
output.
The error matrix is another name for it.
The matrix is made up of the results of the forecasts in a condensed manner, together with the
total number of right and wrong guesses.
MC4301 MACHINE LEARINIG
The matrix appears in the following table:
Actual Positive Actual Negative
Predicted Positive True Positive False Positive
Predicted Negative False Negative True Negative
Accuracy = (TP+TN)/Total Population
2. Log Loss or Cross-Entropy Loss
It is used to assess a classifier's performance, and the output is a probability value between 1
and 0.
A successful binary classification model should have a log loss value that is close to 0.
If the anticipated value differs from the actual value, the value of log loss rises.
The lower log loss shows the model’s higher accuracy.
Cross-entropy for binary classification can be calculated as:
(ylog(p)+(1?y)log(1?p))
Where p = Predicted Output, y = Actual output.
3. AUC-ROC Curve
AUC is for Area Under the Curve, and ROC refers to Receiver Operating Characteristics Curve.
It is a graph that displays the classification model's performance at various thresholds.
The AUC-ROC Curve is used to show how well the multi-class classification model performs.
The TPR and FPR are used to draw the ROC curve, with the True Positive
Rate (TPR) on the Y-axis and the FPR (False Positive Rate) on the X-axis.
Use Cases Of Classification Algorithms
There are many applications for classification algorithms. Here are a few of them
Speech Recognition
Detecting Spam Emails
Categorization of Drugs
Cancer Tumor Cell Identification
Biometric Authentication, etc.
MC4301 MACHINE LEARINIG
Representation
A machine learning model can't directly see, hear, or sense input examples. Instead, you must
create a representation of the data to provide the model with a useful vantage point into the data's
key qualities. That is, in order to train a model, you must choose the set of features that best
represent the data. The choice of representation has an enormous effect on the performance of
machine learning algorithms In the context of neural networks, Chollet says that layers extract
representations.
The core building block of neural networks is the layer, a data-processing module that you can
think of as a filter for data. Some data goes in, and it comes out in a more useful form.
Specifically, layers extract representations out of the data fed into them--hopefully,
representations that are more meaningful for the problem at hand.
Most of deep learning consists of chaining together simple layers that will implement a form of
progressive data distillation. A deep-learning model is like a sieve for data processing, made of a
succession of increasingly refined data filters--the layers.
That makes me think that representations are the form that the training/test data takes as it is
progressively transformed. e.g. words could initially be represented as dense or sparse (one-hot
encoded) vectors. And then their representation changes one or more times as they are fed into a
model.
Mitchell says that we need to choose a representation for the target function.
Now that we have specified the ideal target function V, we must choose a representation
that the learning program will use to describe the function V^ that it will learn.
This makes me think that the 'representation' could be described as the architecture of the model,
or maybe a mathematical description of the model. With this definition, we don't know the true
representation (equation) of the target function (if we did we would have nothing to learn). So it
is our task to decide what equation we want to use to best approximate the target function
Cost Function in Machine Learning
A Machine Learning model should have a very high level of accuracy in order to perform
well with real-world applications. But how to calculate the accuracy of the model, i.e.,
how good or poor our model will perform in the real world? In such a case, the Cost
function comes into existence. It is an important machine learning parameter to correctly
estimate the model.
MC4301 MACHINE LEARINIG
Cost function also plays a crucial role in understanding that how well your model estimates
the relationship between the input and output parameters.
What is Cost Function?
A cost function is an important parameter that determines how well a machine learning
model performs for a given dataset. It calculates the difference between the expected value
and predicted value and represents it as a single real number.
In simple, “Cost function is a measure of how wrong the model is in estimating the
relationship between X(input) and Y(output) Parameter.” A cost function is sometimes also
referred to as Loss function, and it can be estimated by iteratively running the model to
compare estimated predictions against the known values of Y.
The main aim of each ML model is to determine parameters or weights that can minimize the
cost function.
Why use Cost Function?
While there are different accuracy parameters, then why do we need a Cost function for the
Machine learning model. So, we can understand it with an example of the classification of
data. Suppose we have a dataset that contains the height and weights of cats & dogs, and we
need to classify them accordingly. If we plot the records using these two features, we will get
a scatter plot as below:
In the above image, the green dots are cats, and the yellow dots are dogs. Below
are the three possible solutions for this classification problem.
MC4301 MACHINE LEARINIG
In the above solutions, all three classifiers have high accuracy, but the third solution is the
best because it correctly classifies each datapoint. The reason behind the best classification is
that it is in mid between both the classes, not close or not far to any of them.
To get such results, we need a Cost function. It means for getting the optimal solution; we
need a Cost function. It calculated the difference between the actual values and predicted
values and measured how wrong was our model in the prediction. By minimizing the value
of the cost function, we can get the optimal solution.
What Is Cost Function in Machine Learning?
After training your model, you need to see how well your model is performing. While accuracy
functions tell you how well the model is performing, they do not provide you with an insight on
how to better improve them. Hence, you need a correctional function that can help you compute
when the model is the most accurate, as you need to hit that small spot between an undertrained
model and an overtrained model.
A Cost Function is used to measure just how wrong the model is in finding a relation between the
input and output. It tells you how badly your model is behaving/predicting Consider a robot
MC4301 MACHINE LEARINIG
trained to stack boxes in a factory. The robot might have to consider certain changeable
parameters, called Variables, which influence how it performs Let’s say the robot comes across
an obstacle, like a rock. The robot might bump into the rock and realize that it is not the correct
action.
It will learn from this, and next time it will learn to avoid rocks. Hence, your machine uses
variables to better fit the data. The outcome of all these obstacles will further optimize the robot
and help it perform better. It will generalize and learn to avoid obstacles in general, say like a fire
that might have broken out. The outcome acts as a cost function, which helps you optimize the
variable, to get the best variables and fit for the model.
Figure 1: Robot learning to avoid obstacles
What Is Gradient Descent?
Gradient Descent is an algorithm that is used to optimize the cost function or the error of the
model. It is used to find the minimum value of error possible in your model. Gradient Descent
can be thought of as the direction you have to take to reach the least possible error. The error in
your model can be different at different points, and you have to find the quickest way to
minimize it, to prevent resource wastage.
Gradient Descent can be visualized as a ball rolling down a hill. Here, the ball will roll to the
lowest point on the hill. It can take this point as the point where the error is least as for any
model, the error will be minimum at one point and will increase again after that.
In gradient descent, you find the error in your model for different values of input variables. This
is repeated, and soon you see that the error values keep getting smaller and smaller. Soon you’ll
arrive at the values for variables when the error is the least, and the cost function is optimized.
MC4301 MACHINE LEARINIG
Figure 2: Gradient Descent
What Is the Cost Function For Linear Regression?
A Linear Regression model uses a straight line to fit the model. This is done using the equation
for a straight line as shown :
Figure 3: Linear regression function
In the equation, you can see that two entities can have changeable values (variable) a, which is
the point at which the line intercepts the x-axis, and b, which is how steep the line will be, or
slope.
At first, if the variables are not properly optimized, you get a line that might not properly fit the
model. As you optimize the values of the model, for some variables, you will get the perfect fit.
The perfect fit will be a straight line running through most of the data points while ignoring the
noise and outliers. A properly fit Linear Regression model looks as shown below :
Figure 4: Linear regression graph
MC4301 MACHINE LEARINIG
For the Linear regression model, the cost function will be the minimum of the Root Mean
Squared Error of the model, obtained by subtracting the predicted values from actual values. The
cost function will be the minimum of these error values.
Figure 5: Linear regression cost function
By the definition of gradient descent, you have to find the direction in which the error decreases
constantly. This can be done by finding the difference between errors. The small difference
between errors can be obtained by differentiating the cost function and subtracting it from the
previous gradient descent to move down the slope
Figure 6: Linear regression gradient descent function
In the above equations, a is known as the learning rate. It decides how fast you move down the
slope. If alpha is large, you take big steps, and if it is small; you take small steps. If alpha is too
large, you can entirely miss the least error point and our results will not be accurate. If it is too
small it will take too long to optimize the model and you will also waste computational power.
Hence you need to choose an optimal value of alpha.
MC4301 MACHINE LEARINIG
Figure 8: (a) Large learning rate, (b) Small learning rate, (c) Optimum learning rate
What Is the Cost Function for Neural Networks?
A neural network is a machine learning algorithm that takes in multiple inputs, runs them
through an algorithm, and essentially sums the output of the different algorithms to get the final
output.
The cost function of a neural network will be the sum of errors in each layer. This is done by
finding the error at each layer first and then summing the individual error to get the total error. In
the end, it can represent a neural network with cost function optimization as :
Figure 9: Neural network with the error function
For neural networks, each layer will have a cost function, and each cost function will
have its own least minimum error value. Depending on where you start, you can arrive at a
unique value for the minimum error. You need to find the minimum value out of all local
minima. This value is called the global minima.
MC4301 MACHINE LEARINIG
Figure 10: Cost function graph for Neural Networks
The cost function for neural networks is given as :
Figure 11: Cost function for Neural Networks
Gradient descent is just the differentiation of the cost function. It is given as :
Figure 12: Gradient descent for Neural Networks
Types of the cost function
There are many cost functions in machine learning and each has its use cases depending
on whether it is a regression problem or classification problem.
1. Regression cost Function
2. Binary Classification cost Functions
3. Multi-class Classification cost Functions
1. Regression cost Function:
Regression models deal with predicting a continuous value for example salary of an employee,
price of a car, loan prediction, etc. A cost function used in the regression problem is called
“Regression Cost Function”. They are calculated on the distance-based error as follows:
Error = y-y’
Where,
Y – Actual Input
MC4301 MACHINE LEARINIG
Y’ – Predicted output
The most used Regression cost functions are below,
1.1 Mean Error (ME)
In this cost function, the error for each training data is calculated and then the mean value of all
these errors is derived.
Calculating the mean of the errors is the simplest and most intuitive way possible.
The errors can be both negative and positive. So they can cancel each other out during
summation giving zero mean error for the model.
Thus this is not a recommended cost function but it does lay the foundation for other cost
functions of regression models.
1.2 Mean Squared Error (MSE)
This improves the drawback we encountered in Mean Error above. Here a square of the
difference between the actual and predicted value is calculated to avoid any possibility of
negative error.
It is measured as the average of the sum of squared differences between predictions and actual
observations.
MSE = (sum of squared errors)/n
It is also known as L2 loss.
In MSE, since each error is squared, it helps to penalize even small deviations in prediction
when compared to MAE. But if our dataset has outliers that contribute to larger prediction errors,
then squaring this error further will magnify the error many times more and also lead to higher
MSE error.
Hence we can say that it is less robust to outliers
1.3 Mean Absolute Error (MAE)
This cost function also addresses the shortcoming of mean error differently. Here an absolute
difference between the actual and predicted value is calculated to avoid any possibility of
negative error. So in this cost function, MAE is measured as the average of the sum of absolute
differences between predictions and actual observations.
MC4301 MACHINE LEARINIG
MAE = (sum of absolute errors)/n
It is also known as L1 Loss.
It is robust to outliers thus it will give better results even when our dataset has noise or outliers.
2. Cost functions for Classification problems
Cost functions used in classification problems are different than what we use in the regression
problem. A commonly used loss function for classification is the cross-entropy loss. Let us
understand cross-entropy with a small example. Consider that we have a classification problem
of 3 classes as follows.
Class(Orange,Apple,Tomato)
The machine learning model will give a probability distribution of these 3 classes as output for a
given input data. The class with the highest probability is considered as a winner class for
prediction.
Output = [P(Orange),P(Apple),P(Tomato)]
The actual probability distribution for each class is shown below.
Orange = [1,0,0]
Apple = [0,1,0]
Tomato = [0,0,1]
If during the training phase, the input class is Tomato, the predicted probability distribution
should tend towards the actual probability distribution of Tomato. If the predicted probability
distribution is not closer to the actual one, the model has to adjust its weight. This is where cross-
entropy becomes a tool to calculate how much far the predicted probability distribution from the
actual one is. In other words, Cross-entropy can be considered as a way to measure the distance
between two probability distributions. The following image illustrates the intuition behind
cross-entropy:
MC4301 MACHINE LEARINIG
FIg 3: Intuition behind croos-entropy (credit – machinelearningknowledge.ai )
This was just an intuition behind cross-entropy. It has its origin in information theory. Now with
this understanding of cross-entropy, let us now see the classification cost functions.
2.1 Multi-class Classification cost Functions
This cost function is used in the classification problems where there are multiple classes and
input data belongs to only one class. Let us now understand how cross-entropy is calculated. Let
us assume that the model gives the probability distribution as below for ‘n’ classes & for a
particular input data D.
And the actual or target probability distribution of the data D is
Then cross-entropy for that particular data D is calculated as
Cross-entropy loss(y,p) = – yTlog(p)
= -(y1log(p1) + y2log(p2) + ……ynlog(pn) )
MC4301 MACHINE LEARINIG
Let us now define the cost function using the above example (Refer cross entropy
image -Fig3),
p(Tomato) = [0.1, 0.3, 0.6]
y(Tomato) = [0, 0, 1]
Cross-Entropy(y,P) = – (0*Log(0.1) + 0*Log(0.3)+1*Log(0.6)) = 0.51
The above formula just measures the cross-entropy for a single observation or input data. The
error in classification for the complete model is given by categorical cross-entropy which is
nothing but the mean of cross-entropy for all N training data.
Categorical Cross-Entropy = (Sum of Cross-Entropy for N data)/N
2.2 Binary Cross Entropy Cost Function
Binary cross-entropy is a special case of categorical cross-entropy when there is only one output
that just assumes a binary value of 0 or 1 to denote negative and positive class respectively. For
example-classification between cat & dog.
Let us assume that actual output is denoted by a single variable y, then cross-entropy
for a particular data D is can be simplified as follows –
Cross-entropy(D) = – y*log(p) when y = 1
Cross-entropy(D) = – (1-y)*log(1-p) when y = 0
The error in binary classification for the complete model is given by binary cross-entropy which
is nothing but the mean of cross-entropy for all N training data.
Binary Cross-Entropy = (Sum of Cross-Entropy for N data)/N
Gradient descent
What is gradient descent?
Gradient descent is an optimization algorithm which is commonly-used to train machine learning
models and neural networks. Training data helps these models learn over time, and the cost
function within gradient descent specifically acts as a barometer, gauging its accuracy with each
iteration of parameter updates. Until the function is close to or equal to zero, the model will
MC4301 MACHINE LEARINIG
continue to adjust its parameters to yield the smallest possible error. Once machine learning
models are optimized for accuracy, they can be powerful tools for artificial intelligence (AI) and
computer science applications.
How does gradient descent work?
Before we dive into gradient descent, it may help to review some concepts from linear
regression. You may recall the following formula for the slope of a line, which is y =mx + b,
where mrepresents the slope and bis the intercept on the y-axis. You may also recall plotting a
scatterplot in statistics and finding the line of best fit, which required calculating the error
between the actual output and the predicted output (y-hat) using the mean squared error formula.
The gradient descent algorithm behaves similarly, but it is based on a convex function, such as
the one below:
The starting point is just an arbitrary point for us to evaluate the performance. From that starting
point, we will find the derivative (or slope), and from there, we can use a tangent line to observe
the steepness of the slope. The slope will inform the updates to the parameters—i.e. the weights
and bias. The slope at the starting point will be steeper, but as new parameters are generated, the
steepness should gradually reduce until it reaches the lowest point on the curve, known as the
point of convergence.
Similar to finding the line of best fit in linear regression, the goal of gradient descent is to
minimize the cost function, or the error between predicted and actual y. In order to do this, it
requires two data points—a direction and a learning rate. These factors determine the partial
derivative calculations of future iterations, allowing it to gradually arrive at the local or global
minimum (i.e. point of convergence).
More detail on these components can be found below:
Learning rate (also referred to as step size or the alpha) is the size of the steps that are taken to
reach the minimum. This is typically a small value, and it is evaluated and updated based on the
MC4301 MACHINE LEARINIG
behavior of the cost function. High learning rates result in larger steps but risks overshooting the
minimum. Conversely, a low learning rate has small step sizes. While it has the advantage of
more precision, the number of iterations compromises overall efficiency as this takes more time
and computations to reach the minimum.
The cost (or loss) function measures the difference, or error, between actual y and predicted y at
its current position. This improves the machine learning model's efficacy by providing feedback
to the model so that it can adjust the parameters to minimize the error and find the local or global
minimum. It continuously iterates, moving along the direction of steepest descent (or the
negative gradient) until the cost function is close to or at zero. At this point, the model will stop
learning. Additionally, while the terms, cost function and loss function, are considered
synonymous, there is a slight difference between them. It’s worth noting that a loss function
refers to the error of one training example, while a cost function calculates the average error
across an entire training set.
Types of Gradient Descent
There are three types of gradient descent learning algorithms: batch gradient descent, stochastic
gradient descent and mini-batch gradient descent.
Batch gradient descent
Batch gradient descent sums the error for each point in a training set, updating the model only
after all training examples have been evaluated. This process referred to as a training epoch.
While this batching provides computation efficiency, it can still have a long processing time for
large training datasets as it still needs to store all of the data into memory. Batch gradient descent
also usually produces a stable error gradient and convergence, but sometimes that convergence
point isn’t the most ideal, finding the local minimum versus the global one. Stochastic gradient
descent
MC4301 MACHINE LEARINIG
Stochastic gradient descent (SGD) runs a training epoch for each example within the dataset and
it updates each training example's parameters one at a time. Since you only need to hold one
training example, they are easier to store in memory. While these frequent updates can offer
more detail and speed, it can result in losses in computational efficiency when compared to batch
gradient descent. Its frequent updates can result in noisy gradients, but this can also be helpful in
escaping the local minimum and finding the global one.
Mini-batch gradient descent
Mini-batch gradient descent combines concepts from both batch gradient descent and stochastic
gradient descent. It splits the training dataset into small batch sizes and performs updates on each
of those batches. This approach strikes a balance between the computational efficiency of batch
gradient descent and the speed of stochastic gradient descent.
Challenges with gradient descent
While gradient descent is the most common approach for optimization problems, it does come
with its own set of challenges. Some of them include:
Local minima and saddle points
For convex problems, gradient descent can find the global minimum with ease, but as nonconvex
problems emerge, gradient descent can struggle to find the global minimum, where the model
achieves the best results.
Recall that when the slope of the cost function is at or close to zero, the model stops learning. A
few scenarios beyond the global minimum can also yield this slope, which are local minima and
saddle points. Local minima mimic the shape of a global minimum, where the slope of the cost
function increases on either side of the current point. However, with saddle points, the negative
gradient only exists on one side of the point, reaching a local maximum on one side and a local
minimum on the other.
Its name inspired by that of a horse’s saddle. Noisy gradients can help the gradient escape local
minimums and saddle points.
Vanishing and Exploding Gradients
In deeper neural networks, particular recurrent neural networks, we can also
encounter two other problems when the model is trained with gradient descent and
backpropagation.
MC4301 MACHINE LEARINIG
Vanishing gradients: This occurs when the gradient is too small. As we move backwards during
backpropagation, the gradient continues to become smaller, causing the earlier layers in the
network to learn more slowly than later layers.
When this happens, the weight parameters update until they become insignificant—i.e. 0—
resulting in an algorithm that is no longer learning.
Exploding gradients: This happens when the gradient is too large, creating an unstable model. In
this case, the model weights will grow too large, and they will eventually be represented as NaN.
One solution to this issue is to leverage a dimensionality reduction technique, which can help to
minimize complexity within the model.
Advanced optimization
Gradient Descent is one of the most popular and widely used optimization algorithms. Most of
you must have implemented it, for finding the values of parameters that will minimize the cost
function. In this article, I’ll tell you about some advanced optimization algorithms, through
which you can run logistic regression (or even linear regression) much more quickly than
gradient descent. Also, this will let the algorithms scale much better, to very large machine
learning problems i.e. where we have a large number of features.
If we have a cost function, say J, and we want to minimize it. We write a code that takes input
parameters, say Θ (theta), and computes J(Θ) and its partial derivatives. So, given the code that
does these two things, gradient descent will repeatedly perform an update, to minimize the
function for us. Similarly, there are some advanced algorithms, if we provide a way to compute
these two things, they can minimize the cost function with their sophisticated strategies.
Types of Advanced Algorithms
MC4301 MACHINE LEARINIG
We’ll discuss the three main types of such algorithms which are very useful where a large
number of features are involved.
1. Conjugate Gradient:
It is an iterative algorithm, for solving large sparse systems of linear equations. Mainly, it’s used
for optimization, neural net training, and image restoration. Theoretically, it is defined as a
method that produces an exact solution after a finite number of iterations. However, practically
we can’t get the exact solution as it is unstable w.r.t small perturbations. This is a good option for
high-dimensional models.
2. BFGS:
It stands for Broyden Fletcher Goldfarb Shanno. It is also an iterative algorithm that is
used for solving unconstrained non-linear optimization problems. It basically
determines the descent direction, by preconditioning the gradient with curvature
information. It is done gradually by improving approximation to the Hessian matrix,
of the loss function
MC4301 MACHINE LEARINIG
3. L-BFGS:
This is basically a limited memory version of BFGS, mostly suited to problems with
many variables (more than 1000). We can get a better solution with a smaller number
of iterations. The L-BFGS line search method uses log-linear convergence rates,
which reduces the number of line search iterations. This is a good option for
low-dimensional models.
Advantages and disadvantages over Gradient Descent:
These algorithms have a number of advantages –
1. You don’t have to choose the learning rate manually. They have a clever inter loop called line
search algorithm that automatically chooses a good learning rate and even a different learning
rate for every iteration
2. They end up converging much faster than gradient descent.
The only disadvantage is that they are a bit more complex, which makes them less
preferred.
Regularization
What Are Overfitting and Underfitting?
MC4301 MACHINE LEARINIG
To train our machine learning model, we give it some data to learn from. The process of plotting
a series of data points and drawing the best fit line to understand the relationship between the
variables is called Data Fitting. Our model is the best fit when it can find all necessary patterns in
our data and avoid the random data points and unnecessary patterns called Noise.
If we allow our machine learning model to look at the data too many times, it will find a lot of
patterns in our data, including the ones which are unnecessary. It will learn really well on the test
dataset and fit very well to it. It will learn important patterns, but it will also learn from the noise
in our data and will not be able to predict on other datasets.
A scenario where the machine learning model tries to learn from the details along with the noise
in the data and tries to fit each data point on the curve is called Overfitting.
Conversely, in a scenario where the model has not been allowed to look at our data a sufficient
number of times, the model won’t be able to find patterns in our test dataset. It will not fit
properly to our test dataset and fail to perform on new data too.
A scenario where a machine learning model can neither learn the relationship between variables
in the testing data nor predict or classify a new data point is called Underfitting.
What are Bias and Variance?
A Bias occurs when an algorithm has limited flexibility to learn from data. Such models pay very
little attention to the training data and oversimplify the model therefore the validation error or
prediction error and training error follow similar trends. Such models always lead to a high error
on training and test data. High Bias causes underfitting in our model.
Variance defines the algorithm’s sensitivity to specific sets of data. A model with a high variance
pays a lot of attention to training data and does not generalize therefore the validation error or
prediction error are far apart from each other. Such models usually perform very well on training
data but have high error rates on test data. High Variance causes overfitting in our model.
Overfitting is a phenomenon that occurs when a Machine Learning model is constraint to
training set and not able to perform well on unseen data.
Regularization is a technique used to reduce the errors by fitting the function appropriately on
the given training set and avoid overfitting.
The commonly used regularization techniques are :
1. L1 regularization
2. L2 regularization
3. Dropout regularization
MC4301 MACHINE LEARINIG
A regression model which uses L1 Regularization technique is called LASSO(Least Absolute
Shrinkage and Selection Operator) regression.
A regression model that uses L2 regularization technique is called Ridge regression.
Lasso Regression adds “absolute value of magnitude” of coefficient as penalty term
to the loss function(L).
Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function(L).
NOTE that during Regularization the output function(y_hat) does not change. The change is only
in the loss function.
The output function:
The loss function before regularization:
The loss function after regularization
We define Loss function in Logistic Regression as :
MC4301 MACHINE LEARINIG
L(y_hat,y) = y log y_hat + (1 - y)log(1 - y_hat)
Loss function with no regularization :
L = y log (wx + b) + (1 - y)log(1 - (wx + b))
Lets say the data overfits the above function.
Loss function with L1 regularization :
L = y log (wx + b) + (1 - y)log(1 - (wx + b)) + lambda*||w||1
Loss function with L2 regularization :
L = y log (wx + b) + (1 - y)log(1 - (wx + b)) + lambda*||w||22
lambda is a Hyperparameter Known as regularization constant and it is greater than zero.
lambda > 0
Dropout Regularization
Dropout is a regularization technique for reducing overfitting in neural networks by preventing
complex co-adaptations on training data. It is a very efficient way of performing model
averaging with neural networks. The term "dropout" refers to dropping out units (both hidden
and visible) in a neural network.
A simple and powerful regularization technique for neural networks and deep learning models is
dropout.
How the dropout regularization technique works.
Dropout is a technique where randomly selected neurons are ignored during training. They are
“dropped-out” randomly. This means that their contribution to the activation of downstream
neurons is temporally removed on the forward pass and any weight updates are not applied to the
neuron on the backward pass.
As a neural network learns, neuron weights settle into their context within the network. Weights
of neurons are tuned for specific features providing some specialization.
Neighboring neurons become to rely on this specialization, which if taken too far can result in a
fragile model too specialized to the training data. This reliant on context for a neuron during
training is referred to complex co-adaptations.
You can imagine that if neurons are randomly dropped out of the network during training, that
other neurons will have to step in and handle the representation required to make predictions for
the missing neurons. This is believed to result in multiple independent internal representations
being learned by the network.
MC4301 MACHINE LEARINIG
The effect is that the network becomes less sensitive to the specific weights of neurons. This in
turn results in a network that is capable of better generalization and is less likely to overfit the
training data.
Solving the problems on overfitting
Overfitting occurs when the model performs well on training data but generalizes poorly to
unseen data. Overfitting is a very common problem in Machine Learning and there has been an
extensive range of literature dedicated to studying methods for preventing overfitting.
Eight simple approaches to alleviate overfitting by introducing only one change to the data,
model, or learning algorithm in each approach.
1. Hold-out (data)
Rather than using all of our data for training, we can simply split our dataset into two sets:
training and testing. A common split ratio is 80% for training and 20% for testing. We train our
model until it performs well not only on the training set but also for the testing set. This indicates
good generalization capability since the testing set represents unseen data that were not used for
training. However, this approach would require a sufficiently large dataset to train on even after
splitting.
2. Cross-validation (data)
We can split our dataset into kgroups (k-fold cross-validation). We let one of the groups to be the
testing set (please see hold-out explanation) and the others as the training set, and repeat this
process until each individual group has been used as the testing set (e.g., krepeats). Unlike hold-
out, cross-validation allows all data to be eventually used for training but is also more
computationally expensive than hold-out
MC4301 MACHINE LEARINIG
3. Data augmentation (data)
A larger dataset would reduce overfitting. If we cannot gather more data and are constrained to
the data we have in our current dataset, we can apply data augmentation to artificially increase
the size of our dataset. For example, if we are training for an image classification task, we can
perform various image transformations to our image dataset (e.g., flipping, rotating, rescaling,
shifting).
MC4301 MACHINE LEARINIG
4. Feature selection (data)
If we have only a limited amount of training samples, each with a large number of features, we
should only select the most important features for training so that our model doesn’t need to
learn for so many features and eventually overfit. We can simply test out different features, train
individual models for these features, and evaluate generalization capabilities, or use one of the
various widely used feature
selection methods.
5. L1 / L2 regularization (learning algorithm)
Regularization is a technique to constrain our network from learning a model that is too complex,
which may therefore overfit. In L1 or L2 regularization, we can add a penalty term on the cost
function to push the estimated coefficients towards zero (and not take more extreme values). L2
regularization allows weights to decay towards zero but not to zero, while L1 regularization
allows weights to decay to zero.
6. Remove layers / number of units per layer (model)
MC4301 MACHINE LEARINIG
As mentioned in L1 or L2 regularization, an over-complex model may more likely overfit.
Therefore, we can directly reduce the model’s complexity by removing layers and reduce the size
of our model. We may further reduce complexity by decreasing the number of neurons in the
fully-connected layers. We should have a model with a complexity that sufficiently balances
between underfitting and overfitting for our task.
7. Dropout (model)
By applying dropout, which is a form of regularization, to our layers, we ignore a subset of units
of our network with a set probability. Using dropout, we can reduce interdependent learning
among units, which may have led to overfitting. However, with dropout, we would need more
epochs for our model to converge.
8. Early stopping (model)
MC4301 MACHINE LEARINIG
We can first train our model for an arbitrarily large number of epochs and plot the validation loss
graph (e.g., using hold-out). Once the validation loss begins to degrade (e.g., stops decreasing
but rather begins increasing), we stop the training and save the current model. We can implement
this either by monitoring the loss graph or set an early stopping trigger. The saved model would
be the optimal model for generalization among different training epoch values
Perceptron
A Perceptron is an Artificial Neuron
It is the simplest possible Neural Network
Neural Networks are the building blocks of Machine Learning.
Frank Rosenblatt
Frank Rosenblatt (1928 – 1971) was an American psychologist notable in the field of Artificial
Intelligence
In 1957 he started something really big. He "invented" a Perceptron program, on an IBM 704
computer at Cornell Aeronautical Laboratory.
Scientists had discovered that brain cells (Neurons) receive input from our senses by electrical
signals.
The Neurons, then again, use electrical signals to store information, and to make decisions based
on previous input. Frank had the idea that Perceptron’s could simulate brain principles, with the
ability to learn and make decisions.
The Perceptron
MC4301 MACHINE LEARINIG
The original Perceptron was designed to take a number of binary inputs, and produce one binary
output (0 or 1).
The idea was to use different weights to represent the importance of each input, and that the sum
of the values should be greater than a threshold value before making a decision like true or false
(0 or 1).
Perceptron Example
Imagine a perceptron (in your brain).
The perceptron tries to decide if you should go to a concert.
Is the artist good? Is the weather good?
What weights should these facts have?
Criteria Input Weight
Artists is Good x1 = 0 or 1 w1 = 0.7
Weather is Good x2 = 0 or 1 w2 = 0.6
Friend will Come x3 = 0 or 1 w3 = 0.5
Food is Served x4 = 0 or 1 w4 = 0.3
Alcohol is Served x5 = 0 or 1 w5 = 0.4
The Perceptron Algorithm
Frank Rosenblatt suggested this algorithm:
1. Set a threshold value
MC4301 MACHINE LEARINIG
2. Multiply all inputs with its weights
3. Sum all the results
4. Activate the output
1. Set a threshold value:
Threshold = 1.5
2. Multiply all inputs with its weights:
x1 * w1 = 1 * 0.7 = 0.7
x2 * w2 = 0 * 0.6 = 0
x3 * w3 = 1 * 0.5 = 0.5
x4 * w4 = 0 * 0.3 = 0
x5 * w5 = 1 * 0.4 = 0.4
3. Sum all the results:
0.7 + 0 + 0.5 + 0 + 0.4 = 1.6 (The Weighted Sum)
4. Activate the Output:
Return true if the sum > 1.5 ("Yes I will go to the Concert")
Perceptron Terminology
Perceptron Inputs
Node values
Node Weights
Activation Function
Perceptron Inputs
Perceptron inputs are called nodes.
The nodes have both a value and a weight.
Node Values
In the example above, the node values are: 1, 0, 1, 0, 1
The binary input values (0 or 1) can be interpreted as (no or yes) or (false or true).
Node Weights
MC4301 MACHINE LEARINIG
Weights shows the strength of each node.
In the example above, the node weights are: 0.7, 0.6, 0.5, 0.3, 0.4
The Activation Function
The activation functions maps the result (the weighted sum) into a required value like 0 or 1.
In the example above, the activation function is simple: (sum > 1.5)
The binary output (1 or 0) can be interpreted as (yes or no) or (true or false)
Neural Networks
Neural Networks is one of the most significant discoveries in history.
Neural Networks can solve problems that can't be solved by algorithms:
Medical Diagnosis
Face Detection
Voice Recognition
Neural Networks is the essence of Deep Learning.
The Deep Learning Revolution
The deep learning revolution is here!
The deep learning revolution started around 2010. Since then, Deep Learning has solved many
"unsolvable" problems.
The deep learning revolution was not started by a single discovery. It more or less happened
when several needed factors were ready:
Computers were fast enough
Computer storage was big enough
Better training methods were invented
Better tuning methods were invented
Neurons
Scientists agree that our brain has around 100 billion neurons.
These neurons have hundreds of billions connections between them
MC4301 MACHINE LEARINIG
Image credit: University of Basel, Biozentrum.
Neurons (aka Nerve Cells) are the fundamental units of our brain and nervous system. The
neurons are responsible for receiving input from the external world, for sending output
(commands to our muscles), and for transforming the electrical signals in between.
Neural Networks
Artificial Neural Networks are normally called Neural Networks (NN). Neural networks are in
fact multi-layer Perceptrons. The perceptron defines the first step into multi-layered neural
networks.
The Neural Network Model
Input data (Yellow) are processed against a hidden layer (Blue) and modified against another
hidden layer (Green) to produce the final output (Red).
Tom Mitchell
MC4301 MACHINE LEARINIG
Tom Michael Mitchell (born 1951) is an American computer scientist and University Professor at
the Carnegie Mellon University (CMU). He is a former Chair of the Machine Learning
Department at CMU.
"A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with
experience E."
Tom Mitchell (1999)
E: Experience (the number of times).
T: The Task (driving a car).
P: The Performance (good or bad).
The Giraffe Story
In 2015, Matthew Lai, a student at Imperial College in London created a neural network called
Giraffe.
Giraffe could be trained in 72 hours to play chess at the same level as an international master.
Computers playing chess are not new, but the way this program was created was new. Smart
chess playing programs take years to build, while Giraffe was built in 72 hours with a neural
network.
Multiclass classification
Multiclass classification in Machine Learning classifies data into more than 2 classes or outputs
using a set of features that belong to specific classes. Classification here means categorizing data
and forming groups based on similarities or features.
The independent variables or features play a vital role in classifying our data in a dataset.
Regarding multiclass classification, we have more than two classes in our dependent variable or
output, as seen in Fig.1.
In Gmail, you will find that all incoming emails are segregated based on their content and
importance. Some emails are of the utmost importance and go to the Primary tab, some go to
Social for marketing content, and some go to the Spam folder if they are dangerous or clickbaity.
So, this classification of emails based on their content or their flagging based on specific words
is an example of multiclass classification in machine learning.
MC4301 MACHINE LEARINIG
The above picture is taken from the Iris dataset which depicts that the target variable has three
categories i.e., Virginica, setosa, and Versicolor, which are three species of Iris plant. We might
use this dataset later, as an example of a conceptual understanding of multiclass classification.
Which classifiers do we use in multiclass classification? When do we use them?
We use many algorithms such as Naïve Bayes, Decision trees, SVM, Random forest classifier,
KNN, and logistic regression for classification. But we might learn about only a few of them
here because our motive is to understand multiclass classification. So, using a few algorithms we
will try to cover almost all the relevant concepts related to multiclass classification.
1. Naive Bayes
Naive Bayes is a parametric algorithm that requires a fixed set of parameters or assumptions to
simplify the machine’s learning process. In parametric algorithms, the number of parameters
used is independent of the size of the training data.
Naïve Bayes Assumption:
It assumes that the features of a dataset are entirely independent of each other. But it is
generally not true. That is why we also call it a ‘naïve’ algorithm.
How it works?
It is a classification model based on conditional probability that uses the Bayes theorem to
predict the class of unknown datasets. This model is mostly used for large datasets as it is easy to
build and fast for both training and prediction. Moreover, without hyperparameter tuning, it can
give better results than other algorithms.
MC4301 MACHINE LEARINIG
Naïve Bayes can also be an extremely good text classifier as it performs well, such as in the
spam ham dataset.
Bayes theorem is stated as-
By P (A|B), we are trying to find the probability of event A given that event B is true. It is
also known as posterior probability.
Event B is known as evidence.
P (A) is called priori of A which means it is probability of event before evidence is seen.
P (B|A) is known as conditional probability or likelihood.
Note: Naïve Bayes’ is linear classifier which might not be suitable to classes that are not linearly
separated in a dataset. Let us look at the figure below:
As can be seen in Fig.2b, Classifiers such as KNN can be used for non-linear classification
instead of Naïve Bayes classifier.
Advantages
It is beneficial in cases involving large datasets and many dimensions
One of the most efficient algorithm in terms of training when you have limited data and
very fast when testing
Works well for multiclass classification which involves categorical variables
Disadvantages
MC4301 MACHINE LEARINIG
It is naive in terms of assuming every feature to be independent of one another
Independence of every feature is not possible in real life hence some dependent features
influence the output
Might not generalize well on unseen data as zero is assigned as probability
2. KNN (K-nearest neighbours)
KNN is a supervised machine learning algorithm that can be used to solve both classification and
regression problems. It is one of the simplest yet powerful algorithms. It does not learn a
discriminative function from the training data but memorizes it instead. For this reason, it is also
known as a lazy algorithm.
How it works?
The K-nearest neighbor algorithm forms a majority vote between the K most similar instances,
and it uses a distance metric between the two data points to define them as identical. The most
popular choice is Euclidean distance, which is written as:
K in KNN is the hyperparameter we can choose to get the best possible fit for the dataset.
Suppose we keep the smallest value for K, i.e., K=1. In that case, the model will show low bias
but high variance because our model will be overfitted.
A more significant value for K, k=10, will surely smoothen our decision boundary, meaning low
variance but high bias. So, we always go for a trade-off between the bias and variance, known as
a bias-variance trade-off.
Let us understand more about it by looking at its advantages and disadvantages:
Advantages-
KNN makes no assumptions about the distribution of classes i.e. it is a non-parametric
classifier
It is one of the methods that can be widely used in multiclass classification
It does not get impacted by the outliers
This classifier is easy to use and implement
Disadvantages-
K value is difficult to find as it must work well with test data also, not only with the
training data
MC4301 MACHINE LEARINIG
It is a lazy algorithm as it does not make any models
It is computationally extensive because it measures distance with each data point
Decision Trees
As the name suggests, the decision tree is a tree-like structure of decisions made based on some
conditional statements. This is one of the most used supervised learning methods in classification
problems because of their high accuracy, stability, and easy interpretation. They can map linear
as well as non-linear relationships in a good way.
Let us look at the figure below, Fig.3, where we have used adult census income dataset with two
independent variables and one dependent variable. Our target or dependent variable is income,
which has binary classes i.e, <=50K or >50K.
Fig 3: Decision Tree- Binary Classifier
We can see that the algorithm works based on some conditions, such as Age <50 and Hours>=40,
to further split into two buckets for reaching towards homogeneity. Similarly, we can move
ahead for multiclass classification problem datasets, such as Iris data.
Now a question arises in our mind. How should we decide which column to take first and what is
the threshold for splitting? For splitting a node and deciding threshold for splitting, we use
entropy or Gini index as measures of impurity of a node. We aim to maximize the purity or
homogeneity on each split, as we saw in Fig.2.
MC4301 MACHINE LEARINIG
What is Entropy?
Entropy or Shannon entropy is the measure of uncertainty, which has a similar sense as in
thermodynamics. By entropy, we talk about a lack of information. To understand better, let us
suppose we have a bag full of red and green balls.
Scenario1: 5 red balls and 5 green balls.
If you are asked to take one ball out of it then what is the probability that the ball will be green
colour ball?
Here we all know there will have 50% chances that the ball we pick will be green.
Scenario2: 1 red and 9 green balls
Here the chances of red ball are minimum and we are certain enough that the ball we pick will be
green because of its 9/10 probability.
Scenario3: 0 red and 10 green balls
In this case, we are very certain that the ball we pick is of green colour.
In the second and third scenario, there is high certainty of green ball in our first pick or we can
say there is less entropy. But in the first scenario there is high uncertainty or high entropy.
Entropy ∝ Uncertainty
Formula for entropy:
Where p(i) is probability of an element/class ‘i’ in the data
After finding entropy we find Information gain which is written as below:
What is Gini Index?
MC4301 MACHINE LEARINIG
Gini is another useful metric to decide splitting in decision trees.
Gini Index formula:
Where p(i) is probability of an element/class ‘i’ in the data.
We have always seen logistic regression is a supervised classification algorithm being used in
binary classification problems. But here, we will learn how we can extend this algorithm for
classifying multiclass data.
In binary, we have 0 or 1 as our classes, and the threshold for a balanced binary classification
dataset is generally 0.5.
Whereas, in multiclass, there can be 3 balanced classes for which we require 2 threshold values
which can be, 0.33 and 0.66.
But a question arises, by using what method do we calculate threshold and approach multiclass
classification?
So let’s first see a general formula that we use for the logistic regression curve:
Where P is the probability of the event occurring and the above equation derives from here:
MC4301 MACHINE LEARINIG
There are two ways to approach this kind of a problem. They are explained as below:
One vs. Rest (OvR)– Here, one class is considered as positive, and rest all are taken as
negatives, and then we generate n-classifiers. Let us suppose there are 3 classes in a dataset,
therefore in this approach, it trains 3-classifiers by taking one class at a time as positive and rest
two classes as negative. Now, each classifier predicts the probability of a particular class and the
class with the highest probability is the answer.
One vs. One (OvO)– In this approach, n ∗ (n − 1)⁄2 binary classifier models are generated. Here
each classifier predicts one class label. Once we input test data to the classifier, the class which
has been predicted the most is chosen as the answer.
Confusion Matrix in Multi-class Classification
A confusion matrix is a table used in every classification problem to describe the performance of
a model on test data.
As we know about the confusion matrix in binary classification, we can also in multiclass
classification
MC4301 MACHINE LEARINIG
Let’s take an example to understand how we can find precision and recall accuracy using a
confusion matrix in multiclass classification.
Finding precision and recall from above Table 1:
Precision for the Virginica class is the number of correctly predicted virginica species out of
all the predicted virginica species, which is 4/7 = 57.1%. This means that only 4/7 of the
species our predictor classifies as Virginica are virginica. Similarly, we can find for other species,
i.e., for Setosa and Versicolor, precision is 20% and 62.5%, respectively.
Recall for the Virginica class is the number of correctly predicted virginica species out of
actual virginica species, which is 50%. This means that our classifier classified half of the
virginica species as virginica. Similarly, we can find for other species, i.e., for Setosa and
Versicolor, recall is 20% and 71.4%, respectively.
Multiclass Classification vs Multi-label Classification
Multiclass classification is a machine learning task where the goal is to assign instances to one
of multiple predefined classes or categories, where each instance belongs to exactly one class.
Whereas multilabel classification is a machine learning task where each instance can be
associated with multiple labels simultaneously, allowing for the assignment of multiple binary
labels to the instance. In this article we are going to understand the multi-class classification and
multi-label classification, how they are different, how they are evaluated, how to choose the best
method for your problem, and much more.
MC4301 MACHINE LEARINIG
Backpropagation
What is Artificial Neural Networks? A neural network is a group of connected I/O units
where each connection has a weight associated with its computer programs. It helps you to
build predictive models from large databases. This model builds upon the human nervous
system. It helps you to conduct image understanding, human learning, computer speech, etc.
What is Backpropagation? Backpropagation is the essence of neural network training. It is the
method of fine-tuning the weights of a neural network based on the error rate obtained in the
previous epoch (i.e., iteration). Proper tuning of the weights allows you to reduce error rates
and make the model reliable by increasing its generalization.
Backpropagation in neural network is a short form for “backward propagation of errors.” It is a
standard method of training artificial neural networks. This method helps calculate the
gradient of a loss function with respect to all the weights in the network.
How Backpropagation Algorithm
Works The Back propagation algorithm in neural network computes the gradient of the loss
function for a single weight by the chain rule. It efficiently computes one layer at a time, unlike a
MC4301 MACHINE LEARINIG
native direct computation. It computes the gradient, but it does not define how the gradient is
used. It generalizes the computation in the delta rule. Consider the following Back
propagation neural network example diagram to understand:
1. Inputs X, arrive through the preconnected path
2. Input is modeled using real weights W. The weights are usually randomly selected.
3. Calculate the output for every neuron from the input layer, to the hidden layers, to
the output layer.
4. Calculate the error in the
outputs
Error B= Actual Output – Desired
Output
1. Travel back from the output layer to the hidden layer to adjust the weights
such that the error is decreased.
Keep repeating the process until the desired output is achieved
Why We Need Backpropagation?
Most prominent advantages of Backpropagation are:
MC4301 MACHINE LEARINIG
• Backpropagation is fast, simple and easy to program
• It has no parameters to tune apart from the numbers of input
• It is a flexible method as it does not require prior knowledge about the
network
• It is a standard method that generally works well
• It does not need any special mention of the features of the function
to be learned.
What is a Feed Forward Network?
A feedforward neural network is an artificial neural network where the nodes never
form a cycle. This kind of neural network has an input layer, hidden layers, and an
output layer. It is the first and simplest type of artificial neural network.
Types of Backpropagation Networks
Two Types of Backpropagation Networks are:
• Static Back-propagation
• Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input
for static output. It is useful to solve static classification issues like optical character
recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved. After
that, the error is computed and propagated backward. The main difference between both of these
methods is: that the mapping is rapid in static back-propagation while it is nonstatic in recurrent
backpropagation.
Disadvantages of using Backpropagation
•The actual performance of backpropagation on a specific problem is dependent on the input data.
•Back propagation algorithm in data mining can be quite sensitive to noisy data
•You need to use the matrix-based approach for backpropagation instead of mini-batch.
MC4301 MACHINE LEARINIG
Back Propagation plays a critical role in how neural networks improve over time. Here's why:
1. Efficient Weight Update: It computes the gradient of the loss function with respect to
each weight using the chain rule making it possible to update weights efficiently.
2. Scalability: The Back Propagation algorithm scales well to networks with multiple layers
and complex architectures making deep learning feasible.
3. Automated Learning: With Back Propagation the learning process becomes automated
and the model can adjust itself to optimize its performance.
Working of Back Propagation Algorithm
The Back Propagation algorithm involves two main steps: the Forward Pass and the Backward
Pass.
1. Forward Pass Work
In forward pass the input data is fed into the input layer. These inputs combined with their
respective weights are passed to hidden layers. For example in a network with two hidden layers
(h1 and h2) the output from h1 serves as the input to h2. Before applying an activation function,
a bias is added to the weighted inputs.
Each hidden layer computes the weighted sum (`a`) of the inputs then applies an activation
function like ReLU (Rectified Linear Unit) to obtain the output (`o`). The output is passed to
the next layer where an activation function such as softmax converts the weighted outputs into
probabilities for classification.
MC4301 MACHINE LEARINIG
The
forward pass using weights and biases
2. Backward Pass
In the backward pass the error (the difference between the predicted and actual output) is
propagated back through the network to adjust the weights and biases. One common method for
error calculation is the Mean Squared Error (MSE) given by:
MSE=(Predicted Output−Actual Output)2MSE=(Predicted Output−Actual Output)2
Once the error is calculated the network adjusts weights using gradients which are computed
with the chain rule. These gradients indicate how much each weight and bias should be adjusted
to minimize the error in the next iteration. The backward pass continues layer by layer ensuring
that the network learns and improves its performance. The activation function through its
derivative plays a crucial role in computing these gradients during Back Propagation.
Example of Back Propagation in Machine Learning
Let’s walk through an example of Back Propagation in machine learning. Assume the neurons
use the sigmoid activation function for the forward and backward pass. The target output is 0.5
and the learning rate is 1.
MC4301 MACHINE LEARINIG
Example (1) of backpropagation sum
Forward Propagation
1. Initial Calculation
The weighted sum at each node is calculated using:
aj=∑(wi,j∗xi)aj=∑(wi,j∗xi)
Where,
ajaj is the weighted sum of all the inputs and weights at each node
wi,jwi,j represents the weights between the ithithinput and the jthjth neuron
xixi represents the value of the ithith input
o (output): After applying the activation function to a, we get the output of the neuron:
ojoj = activation function(ajaj)
2. Sigmoid Function
The sigmoid function returns a value between 0 and 1, introducing non-linearity into the model.
yj=11+e−ajyj=1+e−aj1
To find the outputs of y3, y4 and y5
MC4301 MACHINE LEARINIG
3. Computing Outputs
At h1 node
a1=(w1,1x1)+(w2,1x2)=(0.2∗0.35)+(0.2∗0.7)=0.21
a1=(w1,1x1)+(w2,1x2 )=(0.2∗0.35)+(0.2∗0.7)=0.21
Once we calculated the a1 value, we can now proceed to find the y3 value:
yj=F(aj)=11+e−a1yj=F(aj)=1+e−a11
y3=F(0.21)=11+e−0.21y3=F(0.21)=1+e−0.211
y3=0.56y3=0.56
Similarly find the values of y4 at h2 and y5 at O3
a2=(w1,2∗x1)+(w2,2∗x2)=(0.3∗0.35)+(0.3∗0.7)=0.315
a2=(w1,2∗x1)+(w2,2∗x2)=(0.3∗0.35)+(0.3∗0.7)=0.315
y4=F(0.315)=11+e−0.315y4=F(0.315)=1+e−0.3151
a3=(w1,3∗y3)+(w2,3∗y4)=(0.3∗0.57)+(0.9∗0.59)=0.702
a3=(w1,3∗y3)+(w2,3∗y4)=(0.3∗0.57)+(0.9∗0.59)=0.702
y5=F(0.702)=11+e−0.702=0.67y5=F(0.702)=1+e−0.7021=0.67
Values of y3, y4 and y5
4. Error Calculation
MC4301 MACHINE LEARINIG
Our actual output is 0.5 but we obtained 0.67. To calculate the error we can use the below
formula:
Errorj=ytarget−y5Errorj=ytarget−y5
=> 0.5−0.67=−0.17=> 0.5−0.67=−0.17
Using this error value we will be backpropagating.
Back Propagation
1. Calculating Gradients
The change in each weight is calculated as:
Δwij=η×δj×OjΔwij=η×δj×Oj
Where:
δjδj is the error term for each unit,
ηη is the learning rate.
2. Output Unit Error
For O3:
δ5=y5(1−y5)(ytarget−y5)δ5=y5(1−y5)(ytarget−y5)
=0.67(1−0.67)(−0.17)=−0.0376=0.67(1−0.67)(−0.17)=−0.0376
3. Hidden Unit Error
For h1:
δ3=y3(1−y3)(w1,3×δ5)δ3=y3(1−y3)(w1,3×δ5)
=0.56(1−0.56)(0.3×−0.0376)=−0.0027=0.56(1−0.56)(0.3×−0.0376)=−0.0027
For h2:
δ4=y4(1−y4)(w2,3×δ5)δ4=y4(1−y4)(w2,3×δ5)
=0.59(1−0.59)(0.9×−0.0376)=−0.0819=0.59(1−0.59)(0.9×−0.0376)=−0.0819
4. Weight Updates
For the weights from hidden to output layer:
Δw2,3=1×(−0.0376)×0.59=−0.022184Δw2,3=1×(−0.0376)×0.59=−0.022184
New weight:
MC4301 MACHINE LEARINIG
w2,3(new)=−0.022184+0.9=0.877816w2,3(new)=−0.022184+0.9=0.877816
For weights from input to hidden layer:
Δw1,1=1×(−0.0027)×0.35=0.000945Δw1,1=1×(−0.0027)×0.35=0.000945
New weight:
w1,1(new)=0.000945+0.2=0.200945w1,1(new)=0.000945+0.2=0.200945
Similarly other weights are updated:
w1,2(new)=0.273225w1,2(new)=0.273225
w1,3(new)=0.086615w1,3(new)=0.086615
w2,1(new)=0.269445w2,1(new)=0.269445
w2,2(new)=0.18534w2,2(new)=0.18534
The updated weights are illustrated below
Through backward pass the weights are updated
After updating the weights the forward pass is repeated yielding:
y3=0.57y3=0.57
y4=0.56y4=0.56
y5=0.61y5=0.61
Since y5=0.61y5=0.61 is still not the target output the process of calculating the error and
backpropagating continues until the desired output is reached.
This process demonstrates how Back Propagation iteratively updates weights by minimizing
errors until the network accurately predicts the output.
Error=ytarget−y5Error=ytarget−y5
MC4301 MACHINE LEARINIG
=0.5−0.61=−0.11=0.5−0.61=−0.11
This process is said to be continued until the actual output is gained by the neural network.
Activation Functions in neural Networks
Activation function decides whether a neuron should be activated by calculating the weighted
sum of inputs and adding a bias term. This helps the model make complex decisions and
predictions by introducing non-linearities to the output of each neuron.
Before diving into the activation function, you should have prior knowledge of the following
topics: Neural Networks, Backpropagation
Introducing Non-Linearity in Neural Network
Non-linearity means that the relationship between input and output is not a straight line. In
simple terms the output does not change proportionally with the input. A common choice is the
ReLU function defined as σ(x)=max(0,x)σ(x)=max(0,x).
Imagine you want to classify apples and bananas based on their shape and color.
If we use a linear function it can only separate them using a straight line.
But real-world data is often more complex like overlapping colors, different lighting, etc.
By adding a non-linear activation function like ReLU, Sigmoid or Tanh the network can
create curved decision boundaries to separate them correctly.
MC4301 MACHINE LEARINIG
Effect of Non-Linearity
The inclusion of the ReLU activation function σσ allows h1h1 to introduce a non-linear decision
boundary in the input space. This non-linearity enables the network to learn more complex
patterns that are not possible with a purely linear model such as:
Modeling functions that are not linearly separable.
Increasing the capacity of the network to form multiple decision boundaries based on the
combination of weights and biases.
Why is Non-Linearity Important in Neural Networks?
Neural networks consist of neurons that operate using weights, biases and activation functions.
In the learning process these weights and biases are updated based on the error produced at the
output—a process known as backpropagation. Activation functions enable backpropagation by
providing gradients that are essential for updating the weights and biases.
Without non-linearity even deep networks would be limited to solving only simple, linearly
separable problems. Activation functions help neural networks to model highly complex data
distributions and solve advanced deep learning tasks. Adding non-linear activation functions
introduce flexibility and enable the network to learn more complex and abstract patterns from
data.
Mathematical Proof of Need of Non-Linearity in Neural Networks
To illustrate the need for non-linearity in neural networks with a specific example let's consider a
network with two input nodes (i1and i2)(i1and i2), a single hidden layer containing
neurons h1 and h2h1 and h2 and an output neuron (out).
We will use w1,w2w1,w2 as weights connecting the inputs to the hidden neuron and w5w5 as the
weight connecting the hidden neuron to the output. We'll also include biases (b1b1 for the hidden
neuron and b2b2 for the output neuron) to complete the model.
1. Input Layer: Two inputs i1i1 and i2i2.
2. Hidden Layer: Two neuron h1h1 and h2h2
3. Output Layer: One output neuron.
MC4301 MACHINE LEARINIG
The input to the hidden neuron h1h1 is calculated as a weighted sum of the inputs plus a bias:
h1=i1.w1+i2.w3+b1h1=i1.w1+i2.w3+b1
h2=i1.w2+i2.w4+b2h2=i1.w2+i2.w4+b2
The output neuron is then a weighted sum of the hidden neuron's output plus a bias:
output=h1.w5+h2.w6+biasoutput=h1.w5+h2.w6+bias
Here, h_1 , h_2 \text{ and output} are linear expressions.
In order to add non-linearity, we will be using sigmoid activation function in the output layer:
σ(x)=11+e−xσ(x)=1+e−x1
final output=σ(h1.w5+h2.w6+bias)final output=σ(h1.w5+h2.w6+bias)
final output=11+e−(h1.w5+h2.w6+bias)final output=1+e−(h1.w5+h2.w6+bias)1
This gives the final output of the network after applying the sigmoid activation function in output
layers, introducing the desired non-linearity.
Types of Activation Functions in Deep Learning
1. Linear Activation Function
MC4301 MACHINE LEARINIG
Linear Activation Function resembles straight line define by y=x. No matter how many layers the
neural network contains if they all use linear activation functions the output is a linear
combination of the input.
The range of the output spans from (−∞ to +∞)(−∞ to +∞).
Linear activation function is used at just one place i.e. output layer.
Using linear activation across all layers makes the network's ability to learn complex
patterns limited.
Linear activation functions are useful for specific tasks but must be combined with non-linear
functions to enhance the neural network’s learning and predictive capabilities.
Linear Activation
Function or Identity Function returns the input as the output
2. Non-Linear Activation Functions
1. Sigmoid Function
Sigmoid Activation Function is characterized by 'S' shape. It is mathematically defined
asA=11+e−xA=1+e−x1. This formula ensures a smooth and continuous output that is essential
for gradient-based optimization methods.
It allows neural networks to handle and model complex patterns that linear equations
cannot.
The output ranges between 0 and 1, hence useful for binary classification.
The function exhibits a steep gradient when x values are between -2 and 2. This
sensitivity means that small changes in input x can cause significant changes in output y
which is critical during the training process.
MC4301 MACHINE LEARINIG
Sigmoid or
Logistic Activation Function Graph
2. Tanh Activation Function
Tanh function (hyperbolic tangent function) is a shifted version of the sigmoid, allowing it to
stretch across the y-axis. It is defined as:
f(x)=tanh(x)=21+e−2x−1.f(x)=tanh(x)=1+e−2x2−1.
Alternatively, it can be expressed using the sigmoid function:
tanh(x)=2×sigmoid(2x)−1tanh(x)=2×sigmoid(2x)−1
Value Range: Outputs values from -1 to +1.
Non-linear: Enables modeling of complex data patterns.
Use in Hidden Layers: Commonly used in hidden layers due to its zero-centered output,
facilitating easier learning for subsequent layers.
MC4301 MACHINE LEARINIG
Tanh Activation Function
3. ReLU (Rectified Linear Unit) Function
ReLU activation is defined by A(x)=max(0,x)A(x)=max(0,x), this means that if the input x is
positive, ReLU returns x, if the input is negative, it returns 0.
Value Range: [0,∞)[0,∞), meaning the function only outputs non-negative values.
Nature: It is a non-linear activation function, allowing neural networks to learn complex
patterns and making backpropagation more efficient.
Advantage over other Activation: ReLU is less computationally expensive than tanh
and sigmoid because it involves simpler mathematical operations. At a time only a few
neurons are activated making the network sparse making it efficient and easy for
computation.
ReLU Activation Function
3. Exponential Linear Units
1. Softmax Function
Softmax function is designed to handle multi-class classification problems. It transforms raw
output scores from a neural network into probabilities. It works by squashing the output values of
each class into the range of 0 to 1 while ensuring that the sum of all probabilities equals 1.
Softmax is a non-linear activation function.
The Softmax function ensures that each class is assigned a probability, helping to identify
which class the input belongs to.
MC4301 MACHINE LEARINIG
Softmax Activation Function
2. SoftPlus Function
Softplus function is defined mathematically as: A(x)=log(1+ex)A(x)=log(1+ex).
This equation ensures that the output is always positive and differentiable at all points which is
an advantage over the traditional ReLU function.
Nature: The Softplus function is non-linear.
Range: The function outputs values in the range (0,∞)(0,∞), similar to ReLU, but without
the hard zero threshold that ReLU has.
Smoothness: Softplus is a smooth, continuous function, meaning it avoids the sharp
discontinuities of ReLU which can sometimes lead to problems during optimization.
MC4301 MACHINE LEARINIG
Softplus Activation Function
Parametric ReLU (PReLU)
Parametric ReLU Function. Image by the Author.
The parametric ReLU is a variant of ReLU that incorporates a learnable parameter to control the
slope of negative values. PReLU allows the slope to be adjusted during the training process,
which enables the network to have an optimal slope. As you would have guessed, this is a fix for
the main drawback of Leaky ReLU.
Mathematically, it is given as:
Mathematical Representation of PReLU. Image by the Author.
Here, ‘z’ represents the input, and ‘a’ represents the learnable parameter that controls the
slope.
Impact of Activation Functions on Model Performance
The choice of activation function has a direct impact on the performance of a neural network in
several ways:
1. Convergence Speed: Functions like ReLU allow faster training by avoiding the
vanishing gradient problem while Sigmoid and Tanh can slow down convergence in deep
networks.
MC4301 MACHINE LEARINIG
2. Gradient Flow: Activation functions like ReLU ensure better gradient flow, helping
deeper layers learn effectively. In contrast Sigmoid can lead to small gradients, hindering
learning in deep layers.
3. Model Complexity: Activation functions like Softmax allow the model to handle
complex multi-class problems, whereas simpler functions like ReLU or Leaky ReLU are
used for basic layers.
Activation functions are the backbone of neural networks enabling them to capture non-linear
relationships in data. From classic functions like Sigmoid and Tanh to modern variants like
ReLU and Swish, each has its place in different types of neural networks.
Dropout as regularization
What is Dropout?
“Dropout” in machine learning refers to the process of randomly ignoring certain nodes in a
layer during training. In the figure below, the neural network on the left represents a typical
neural network where all units are activated. On the right, the red units have been dropped
out of the model — the values of their weights and biases are not considered during training.
Dropout is used as a regularization technique — it prevents overfitting by ensuring that no
units are codependent.
Other Common Regularization Methods When it comes to combating overfitting, dropout is
definitely not the only option. Common regularization techniques include:
1. Early stopping: stop training automatically when a specific performance
measure (eg. Validation loss, accuracy) stops improving
MC4301 MACHINE LEARINIG
2. Weight decay: incentivize the network to use smaller weights by adding a
penalty to the loss function (this ensures that the norms of the weights are relatively
evenly distributed amongst all the weights in the networks, which prevents just a few
weights from heavily influencing network output)
3. Noise: allow some random fluctuations in the data through augmentation
(which makes the network robust to a larger distribution of inputs and hence improves
generalization)
4. Model combination: average the outputs of separately trained neural networks
(requires a lot of computational power, data, and time) Dropout remains an extremely
popular protective measure against overfitting because of its efficiency and
effectiveness.
How Does Dropout Work?
When we apply dropout to a neural network, we’re creating a “thinned” network with unique
combinations of the units in the hidden layers being dropped randomly at different points in
time during training. Each time the gradient of our model is updated, we generate a new
thinned neural network with different units dropped based on a probability hyperparameter p.
Training a network using dropout can thus be viewed as training loads of different thinned
neural networks and merging them into one network that picks up the key properties of each
thinned network.
It is observed that the models with dropout had a lower classification error than the same
models without dropout at any given point in time. A similar trend was observed when the
models were used to train other datasets in vision, as well as speech recognition and text
analysis. The lower error is because dropout helps prevent overfitting on the training data
by reducing the reliance of each unit in the hidden layer on other units in the hidden layers.
MC4301 MACHINE LEARINIG