0% found this document useful (0 votes)

38 views9 pages

Overfitting & Underfitting in Machine Learning

Overfitting in machine learning occurs when a model learns too much from the training data, including noise and inaccuracies, leading to poor performance on unseen data. It is characterized by low bias and high variance, while underfitting occurs when a model fails to capture the underlying trend of the data. Techniques to prevent overfitting include early stopping, increasing training data, feature selection, cross-validation, data augmentation, regularization, and ensemble methods.

Uploaded by

kritika221001105034

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views9 pages

Overfitting & Underfitting in Machine Learning

Uploaded by

kritika221001105034

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Overfitting in Machine Learning

In the real world, the dataset present will never be clean and perfect. It means each
dataset contains impurities, noisy data, outliers, missing data, or imbalanced data. Due
to these impurities, different problems occur that affect the accuracy and the
performance of the model. One of such problems is Overfitting in Machine
Learning. Overfitting is a problem that a model can exhibit.

A statistical model is said to be overfitted if it can’t generalize well with unseen data.

Before understanding overfitting, we need to know some basic terms, which are:

Noise: Noise is meaningless or irrelevant data present in the dataset. It affects the
performance of the model if it is not removed.

Bias: Bias is a prediction error that is introduced in the model due to oversimplifying
the machine learning algorithms. Or it is the difference between the predicted values
and the actual values.

Variance: If the machine learning model performs well with the training dataset, but
does not perform well with the test dataset, then variance occurs.

Generalization: It shows how well a model is trained to predict unseen data.

What is Overfitting?
o Overfitting & underfitting are the two main errors/problems in the machine
learning model, which cause poor performance in Machine Learning.
o Overfitting occurs when the model fits more data than required, and it tries to
capture each and every datapoint fed to it. Hence it starts capturing noise and
inaccurate data from the dataset, which degrades the performance of the
model.
o An overfitted model doesn't perform accurately with the test/unseen dataset
and can’t generalize well.
o An overfitted model is said to have low bias and high variance.

Example to Understand Overfitting

We can understand overfitting with a general example. Suppose there are three
students, X, Y, and Z, and all three are preparing for an exam. X has studied only three
sections of the book and left all other sections. Y has a good memory, hence
memorized the whole book. And the third student, Z, has studied and practiced all the
questions. So, in the exam, X will only be able to solve the questions if the exam has
questions related to section 3. Student Y will only be able to solve questions if they
appear exactly the same as given in the book. Student Z will be able to solve all the
exam questions in a proper way.
The same happens with machine learning; if the algorithm learns from a small part of
the data, it is unable to capture the required data points and hence under fitted.

Suppose the model learns the training dataset, like the Y student. They perform very
well on the seen dataset but perform badly on unseen data or unknown instances. In
such cases, the model is said to be Overfitting.

And if the model performs well with the training dataset and also with the test/unseen
dataset, similar to student Z, it is said to be a good fit.

How to detect Overfitting?

Overfitting in the model can only be detected once you test the data. To detect the
issue, we can perform Train/test split.

In the train-test split of the dataset, we can divide our dataset into random test and
training datasets. We train the model with a training dataset which is about 80% of the
total dataset. After training the model, we test it with the test dataset, which is 20 % of
the total dataset.

Now, if the model performs well with the training dataset but not with the test dataset,
then it is likely to have an overfitting issue.

For example, if the model shows 85% accuracy with training data and 50% accuracy
with the test dataset, it means the model is not performing well.
Underfitting
Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. To avoid the overfitting in the model, the fed of training
data can be stopped at an early stage, due to which the model may not learn enough
from the training data. As a result, it may fail to find the best fit of the dominant trend
in the data.

In the case of underfitting, the model is not able to learn enough from the training
data, and hence it reduces the accuracy and produces unreliable predictions.

An underfitted model has high bias and low variance.

Example: We can understand the underfitting using below output of the linear
regression model:
As we can see from the above diagram, the model is unable to capture the data points
present in the plot.

How to avoid underfitting:

o By increasing the training time of the model.

o By increasing the number of features.

Goodness of Fit
The "Goodness of fit" term is taken from the statistics, and the goal of the machine
learning models to achieve the goodness of fit. In statistics modeling, it defines how
closely the result or predicted values match the true values of the dataset.

The model with a good fit is between the underfitted and overfitted model, and ideally,
it makes predictions with 0 errors, but in practice, it is difficult to achieve it.

As when we train our model for a time, the errors in the training data go down, and
the same happens with test data. But if we train the model for a long duration, then
the performance of the model may decrease due to the overfitting, as the model also
learn the noise present in the dataset. The errors in the test dataset start increasing, so
the point, just before the raising of errors, is the good point, and we can stop here for
achieving a good model.
Ways to prevent the Overfitting
Although overfitting is an error in Machine learning which reduces the performance of
the model, however, we can prevent it in several ways. With the use of the linear model,
we can avoid overfitting; however, many real-world problems are non-linear ones. It is
important to prevent overfitting from the models. Below are several ways that can be
used to prevent overfitting:

1. Early Stopping
2. Train with more data
3. Feature Selection
4. Cross-Validation
5. Data Augmentation
6. Regularization

Early Stopping
In this technique, the training is paused before the model starts learning the noise
within the model. In this process, while training the model iteratively, measure the
performance of the model after each iteration. Continue up to a certain number of
iterations until a new iteration improves the performance of the model.

After that point, the model begins to overfit the training data; hence we need to stop
the process before the learner passes that point.

Stopping the training process before the model starts capturing noise from the data
is known as early stopping.
However, this technique may lead to the underfitting problem if training is paused too
early. So, it is very important to find that "sweet spot" between underfitting and
overfitting.

Train with More data

Increasing the training set by including more data can enhance the accuracy of the
model, as it provides more chances to discover the relationship between input and
output variables.

It may not always work to prevent overfitting, but this way helps the algorithm to
detect the signal better to minimize the errors.

When a model is fed with more training data, it will be unable to overfit all the samples
of data and forced to generalize well.

But in some cases, the additional data may add more noise to the model; hence we
need to be sure that data is clean and free from in-consistencies before feeding it to
the model.

Feature Selection
While building the ML model, we have a number of parameters or features that are
used to predict the outcome. However, sometimes some of these features are
redundant or less important for the prediction, and for this feature selection process
is applied. In the feature selection process, we identify the most important features
within training data, and other features are removed. Further, this process helps to
simplify the model and reduces noise from the data. Some algorithms have the auto-
feature selection, and if not, then we can manually perform this process.

Cross-Validation
Cross-validation is one of the powerful techniques to prevent overfitting.

In the general k-fold cross-validation technique, we divided the dataset into k-equal-
sized subsets of data; these subsets are known as folds.

Data Augmentation
Data Augmentation is a data analysis technique, which is an alternative to adding more
data to prevent overfitting. In this technique, instead of adding more training data,
slightly modified copies of already existing data are added to the dataset.

The data augmentation technique makes it possible to appear data sample slightly
different every time it is processed by the model. Hence each data set appears unique
to the model and prevents overfitting.

Regularization
If overfitting occurs when a model is complex, we can reduce the number of features.
However, overfitting may also occur with a simpler model, more specifically the Linear
model, and for such cases, regularization techniques are much helpful.

Regularization is the most popular technique to prevent overfitting. It is a group of

methods that forces the learning algorithms to make a model simpler. Applying the
regularization technique may slightly increase the bias but slightly reduces the
variance. In this technique, we modify the objective function by adding the penalizing
term, which has a higher value with a more complex model.

The two commonly used regularization techniques are L1 Regularization and L2

Regularization.

Ensemble Methods
In ensemble methods, prediction from different machine learning models is combined
to identify the most popular result.

The most commonly used ensemble methods are Bagging and Boosting.
In bagging, individual data points can be selected more than once. After the collection
of several sample datasets, these models are trained independently, and depending
on the type of task-i.e., regression or classification-the average of those predictions is
used to predict a more accurate result. Moreover, bagging reduces the chances of
overfitting in complex models.

In boosting, a large number of weak learners arranged in a sequence are trained in

such a way that each learner in the sequence learns from the mistakes of the learner
before it. It combines all the weak learners to come out with one strong learner. In
addition, it improves the predictive flexibility of simple models.

Overfitting
No ratings yet
Overfitting
7 pages
U&O Fitting
No ratings yet
U&O Fitting
6 pages
Underfitting and Overfitting
No ratings yet
Underfitting and Overfitting
4 pages
Bias and Variance
No ratings yet
Bias and Variance
4 pages
ML Overfitting & Underfitting Guide
No ratings yet
ML Overfitting & Underfitting Guide
3 pages
Underfitting
No ratings yet
Underfitting
13 pages
Machine Learning Notes Anna University
No ratings yet
Machine Learning Notes Anna University
9 pages
Overfitting and Underfitting
No ratings yet
Overfitting and Underfitting
3 pages
OVERFITTING and UNDERFITTING
No ratings yet
OVERFITTING and UNDERFITTING
5 pages
Overfitting and Underfitting in Machine Learning-1
No ratings yet
Overfitting and Underfitting in Machine Learning-1
3 pages
Overfitting and Its Types
No ratings yet
Overfitting and Its Types
13 pages
Bias and Variance in Machine Learning
No ratings yet
Bias and Variance in Machine Learning
3 pages
Data Science Concepts Overfitting Underfitting
No ratings yet
Data Science Concepts Overfitting Underfitting
8 pages
Bias - Variance
No ratings yet
Bias - Variance
2 pages
Overfitting Vs Underfitting
No ratings yet
Overfitting Vs Underfitting
3 pages
Underfitting and Overfitting Slides and Transcript
No ratings yet
Underfitting and Overfitting Slides and Transcript
13 pages
Emsemble Methods-Pages-Deleted
No ratings yet
Emsemble Methods-Pages-Deleted
2 pages
UNIT - 5 Data Science
No ratings yet
UNIT - 5 Data Science
34 pages
Machine Leafning
No ratings yet
Machine Leafning
5 pages
Unit II - 2.5 - Overfitting Underfitting at CSJMU - 6 Slides Handouts
No ratings yet
Unit II - 2.5 - Overfitting Underfitting at CSJMU - 6 Slides Handouts
5 pages
Lecture - 1
No ratings yet
Lecture - 1
35 pages
Model Evaluation
No ratings yet
Model Evaluation
29 pages
Understanding Overfitting, Underfitting, Oversampling, and SMOTE in Machine Learning
No ratings yet
Understanding Overfitting, Underfitting, Oversampling, and SMOTE in Machine Learning
9 pages
ML & DL
No ratings yet
ML & DL
19 pages
Week 15
No ratings yet
Week 15
41 pages
Machine Learning Basics Understanding Overfitting and Underfitting
No ratings yet
Machine Learning Basics Understanding Overfitting and Underfitting
11 pages
Lec-1 Bias-variance-Tradeoff
No ratings yet
Lec-1 Bias-variance-Tradeoff
24 pages
ML - Underfitting and Overfitting - GeeksforGeeks
No ratings yet
ML - Underfitting and Overfitting - GeeksforGeeks
8 pages
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
100% (2)
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
26 pages
Complete ML Concepts
No ratings yet
Complete ML Concepts
30 pages
Csa202 Unit 2
No ratings yet
Csa202 Unit 2
36 pages
Overfitting Vs Underfitting
No ratings yet
Overfitting Vs Underfitting
8 pages
Overfitting & Feature Engineering
No ratings yet
Overfitting & Feature Engineering
37 pages
Data Science Unit-I Notes
No ratings yet
Data Science Unit-I Notes
3 pages
Diagnosing Bias Vs Variance
No ratings yet
Diagnosing Bias Vs Variance
11 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
116 pages
DAS
No ratings yet
DAS
3 pages
Chapter 1-ML
No ratings yet
Chapter 1-ML
27 pages
Data Science-Unit-4 - 05.10.23
No ratings yet
Data Science-Unit-4 - 05.10.23
59 pages
DL Unit1
100% (2)
DL Unit1
79 pages
Chapter5 Regularization Summary Final
No ratings yet
Chapter5 Regularization Summary Final
10 pages
Theory in Machine Learning
No ratings yet
Theory in Machine Learning
60 pages
Unit 3 ML
No ratings yet
Unit 3 ML
40 pages
Cross Validation for ML Models
No ratings yet
Cross Validation for ML Models
6 pages
All DL
No ratings yet
All DL
72 pages
Lecture 8
No ratings yet
Lecture 8
15 pages
Overfitting Underfitting Bias Variance
No ratings yet
Overfitting Underfitting Bias Variance
11 pages
15-The Bias - Variance - Trade-Off-08-04-2024
No ratings yet
15-The Bias - Variance - Trade-Off-08-04-2024
23 pages
Bias Variance Overfitting
No ratings yet
Bias Variance Overfitting
3 pages
Machine Learning Model Validation
No ratings yet
Machine Learning Model Validation
50 pages
Module3 DS PPT
No ratings yet
Module3 DS PPT
68 pages
Random Forest
No ratings yet
Random Forest
20 pages
Introduction To ML
No ratings yet
Introduction To ML
55 pages
Overfitting Regression
No ratings yet
Overfitting Regression
14 pages
Overfitting Vs Underfitting
No ratings yet
Overfitting Vs Underfitting
14 pages
Unit 4
No ratings yet
Unit 4
34 pages
1.2 Overfitting Under Fitting and Cross Validation and Confusion Matrix
No ratings yet
1.2 Overfitting Under Fitting and Cross Validation and Confusion Matrix
17 pages
NotesML Compressed
No ratings yet
NotesML Compressed
41 pages
Design of An Intrusion Detection Model For IoT-Enabled Smart Home
No ratings yet
Design of An Intrusion Detection Model For IoT-Enabled Smart Home
18 pages
Unit 4 Updated Notes
No ratings yet
Unit 4 Updated Notes
13 pages
AFRICDSA Certified Data Scientist Syllabus - V1.2
No ratings yet
AFRICDSA Certified Data Scientist Syllabus - V1.2
12 pages
Evaluation of Liquid Loading in Gas Wells Using Machine Learning
No ratings yet
Evaluation of Liquid Loading in Gas Wells Using Machine Learning
12 pages
Boosting Algorithms Explained
No ratings yet
Boosting Algorithms Explained
2 pages
Analytics Compendium
No ratings yet
Analytics Compendium
41 pages
The Stock Exchange Prediction Using Machine Learning Techniques: A Comprehensive and Systematic Literature Review
No ratings yet
The Stock Exchange Prediction Using Machine Learning Techniques: A Comprehensive and Systematic Literature Review
22 pages
Ensemble Solar Forecasting Review
No ratings yet
Ensemble Solar Forecasting Review
15 pages
Ensemble Machine Learning Models For The Detecti - 2021 - Electric Power Systems
No ratings yet
Ensemble Machine Learning Models For The Detecti - 2021 - Electric Power Systems
14 pages
Deep Learning for Poverty Detection
No ratings yet
Deep Learning for Poverty Detection
7 pages
Scalable Malicious URL Classification: Leveraging Lexical Analysis and API Integration
No ratings yet
Scalable Malicious URL Classification: Leveraging Lexical Analysis and API Integration
5 pages
Fake News and Message Detection Project Report: September 2021
100% (1)
Fake News and Message Detection Project Report: September 2021
13 pages
Lec07 Classification ModelEvaluation Ensemble
No ratings yet
Lec07 Classification ModelEvaluation Ensemble
62 pages
Data Science Exec Program Overview
No ratings yet
Data Science Exec Program Overview
30 pages
ML Loss Functions Explained
No ratings yet
ML Loss Functions Explained
5 pages
Classifier Evaluation Techniques
No ratings yet
Classifier Evaluation Techniques
59 pages
Data Science For Supply Chain Forecasting 2nd Edition-Extract2
No ratings yet
Data Science For Supply Chain Forecasting 2nd Edition-Extract2
84 pages
Documentation-Fake News Detection
No ratings yet
Documentation-Fake News Detection
57 pages
Journal - of - Metaverse - 1 - (2) 33
No ratings yet
Journal - of - Metaverse - 1 - (2) 33
6 pages
ML Assignment-2: Unit 3
No ratings yet
ML Assignment-2: Unit 3
21 pages
Predicting and Interpreting Student Performance Using Ensemble Models and Shapley Additive Explanations
No ratings yet
Predicting and Interpreting Student Performance Using Ensemble Models and Shapley Additive Explanations
16 pages
Crypto Sentiment Analysis Insights
No ratings yet
Crypto Sentiment Analysis Insights
16 pages
Interpretable Machine Learning For Predicting Compression Index of Clays Using SHAP and Gradient Boosting Models
No ratings yet
Interpretable Machine Learning For Predicting Compression Index of Clays Using SHAP and Gradient Boosting Models
28 pages
Intrusion Detection System Using Machine Learning Techniques A Review
No ratings yet
Intrusion Detection System Using Machine Learning Techniques A Review
8 pages
Detecting Click Fraud Paper
No ratings yet
Detecting Click Fraud Paper
42 pages
MLQB Unit 3
No ratings yet
MLQB Unit 3
12 pages
Kidney Disease Early-Stage Identification and Prevention Using Supervised Machine Learning
No ratings yet
Kidney Disease Early-Stage Identification and Prevention Using Supervised Machine Learning
6 pages
AI Microdegree Program Overview
No ratings yet
AI Microdegree Program Overview
17 pages
Lecture 4
No ratings yet
Lecture 4
21 pages
ML Unit-3
No ratings yet
ML Unit-3
15 pages