0% found this document useful (0 votes)

39 views8 pages

SMOTE For Imbalanced Classification With Python

python

Uploaded by

seemamehlalockdown

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views8 pages

SMOTE For Imbalanced Classification With Python

python

Uploaded by

seemamehlalockdown

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

SMOTE for Imbalanced Classification with Python

C LA S S I F I C AT I O N I NT E RM E D I AT E M A C HI NE LE A RNI NG PYT HO N S T RUC T URE D D AT A T E C HNI Q UE

Introduction

Imbalanced datasets pose a common challenge for machine learning practitioners in binary classification
problems. This scenario frequently arises in practical business applications like fraud detection, spam
filtering, rare disease discovery, and hardware fault detection. To address this issue, one popular technique
is Synthetic Minority Oversampling Technique (SMOTE). SMOTE is specifically designed to tackle
imbalanced datasets by generating synthetic samples for the minority class. This article explores the
significance of SMOTE in dealing with class imbalance, focusing on its application in improving the
performance of classifier models. By mitigating bias and capturing important features of the minority
class, SMOTE contributes to more accurate predictions and better model performance.

This article was published as a part of the Data Science Blogathon.

Table of contents

The Accuracy Paradox

Dealing with Imbalanced Data
SMOTE: Synthetic Minority Oversampling Technique
ADASYN: Adaptive Synthetic Sampling Approach
Hybridization: SMOTE + Tomek Links
Hybridization: SMOTE + ENN
Performance Analysis after Resampling
End Notes
Frequently Asked Questions

The Accuracy Paradox

Suppose, you’re working on a health insurance based fraud detection problem. In such problems, we
generally observe that in every 100 insurance claims 99 of them are non-fraudulent and 1 is fraudulent. So
a binary classifier model need not be a complex model to predict all outcomes as 0 meaning non-fraudulent
and achieve a great accuracy of 99%. Clearly, in such cases where class distribution is skewed, the
accuracy metric is biased and not preferable.

Dealing with Imbalanced Data

Resampling data is one of the most commonly preferred approaches to deal with an imbalanced dataset.
There are broadly two types of methods for this i) Undersampling ii) Oversampling. In most cases,
oversampling is preferred over undersampling techniques. The reason being, in undersampling we tend to
remove instances from data that may be carrying some important information. In this article, I am
specifically covering some special data augmentation oversampling techniques: SMOTE and its related
counterparts.

SMOTE: Synthetic Minority Oversampling Technique

SMOTE is an oversampling technique where the synthetic samples are generated for the minority class.
This algorithm helps to overcome the overfitting problem posed by random oversampling. It focuses on the
feature space to generate new instances with the help of interpolation between the positive instances that
lie together.

Working Procedure

At first the total no. of oversampling observations, N is set up. Generally, it is selected such that the binary
class distribution is 1:1. But that could be tuned down based on need. Then the iteration starts by first
selecting a positive class instance at random. Next, the KNN’s (by default 5) for that instance is obtained.
At last, N of these K instances is chosen to interpolate new synthetic instances. To do that, using any
distance metric the difference in distance between the feature vector and its neighbors is calculated. Now,
this difference is multiplied by any random value in (0,1] and is added to the previous feature vector. This
is pictorially represented below:
Source: GitHub

Python Code for SMOTE Algorithm

Though this algorithm is quite useful, it has few drawbacks associated with it.

The synthetic instances generated are in the same direction i.e. connected by an artificial line its
diagonal instances. This in turn complicates the decision surface generated by few classifier
algorithms.
SMOTE tends to create a large no. of noisy data points in feature space.

ADASYN: Adaptive Synthetic Sampling Approach

ADASYN is a generalized form of the SMOTE algorithm. This algorithm also aims to oversample the
minority class by generating synthetic instances for it. But the difference here is it considers the density
distribution, r i which decides the no. of synthetic instances generated for samples which difficult to learn.
Due to this, it helps in adaptively changing the decision boundaries based on the samples difficult to learn.
This is the major difference compared to SMOTE.

Working Procedure
From the dataset, the total no. of majority N – and minority N + are captured respectively. Then we
preset the threshold value, d th for the maximum degree of class imbalance. Total no. of synthetic
samples to be generated, G = (N – – N + ) x β. Here, β = (N + / N – ).
For every minority sample xi , KNN’s are obtained using Euclidean distance, and ratio r i is calculated
as Δi/k and further normalized as r x <= r i / ∑ rᵢ.

Thereafter, the total synthetic samples for each x i will be, g i = r x x G. Now we iterate from 1 to g i to
generate samples the same way as we did in SMOTE.

The below-given diagram represents the above procedure:

Source: GitHub

Python Code for ADASYN Algorithm

Hybridization: SMOTE + Tomek Links

Hybridization techniques involve combining both undersampling and oversampling techniques. This is
done to optimize the performance of classifier models for the samples created as part of these techniques.
SMOTE+TOMEK is such a hybrid technique that aims to clean overlapping data points for each of the
classes distributed in sample space. After the oversampling is done by SMOTE, the class clusters may be
invading each other’s space. As a result, the classifier model will be overfitting. Now, Tomek links are the
opposite class paired samples that are the closest neighbors to each other. Therefore the majority of class
observations from these links are removed as it is believed to increase the class separation near the
decision boundaries. Now, to get better class clusters, Tomek links are applied to oversampled minority
class samples done by SMOTE. Thus instead of removing the observations only from the majority class, we
generally remove both the class observations from the Tomek links.

Python Code for the SMOTE + Tomek Algorithm

Hybridization: SMOTE + ENN

SMOTE + ENN is another hybrid technique where more no. of observations are removed from the sample
space. Here, ENN is yet another undersampling technique where the nearest neighbors of each of the
majority class is estimated. If the nearest neighbors misclassify that particular instance of the majority
class, then that instance gets deleted.

Integrating this technique with oversampled data done by SMOTE helps in doing extensive data cleaning.
Here on misclassification by NN’s samples from both the classes are removed. This results in a more clear
and concise class separation.

Python Code for SMOTE + ENN Algorithm

The below-given picture shows how different SMOTE based resampling techniques work out to deal with
imbalanced data.
Performance Analysis after Resampling

To understand the effect of oversampling, I will be using a bank customer churn dataset. It is an
imbalanced data where the target variable, churn has 81.5% customers not churning and 18.5% customers
who have churned.

A comparative analysis was done on the dataset using 3 classifier models: Logistic Regression, Decision
Tree, and Random Forest. As discussed earlier, we’ll ignore the accuracy metric to evaluate the
performance of the classifier on this imbalanced dataset. Here, we are more interested to know that which
are the customers who’ll churn out in the coming months. Thereby, we’ll focus on metrics like precision,
recall, F1-score to understand the performance of the classifiers for correctly determining which
customers will churn.

Note: The SMOTE and its related techniques are only applied to the training dataset so that we fit our
algorithm properly on the data. The test data remains unchanged so that it correctly represents the original
data.
From the above, it can be seen on the actual imbalanced dataset, all 3 classifier models were not able to
generalize well on the minority class compared to the majority class. As a result, most of the negative
class samples were correctly classified. Due to this, there was less FP compared to more FN. After
oversampling, a clear surge in Recall is seen on the test data. To understand this better, a comparative
barplot is shown below for all 3 models:

There is a decrease in precision, but by achieving a much recall which satisfies the objective of any binary
classification problem. Also, the AUC-ROC and F1-score for each model remain more or less the same.

End Notes

The issue of class imbalance is just not limited to binary classification problems, multi-class classification
problems equally suffer with it. Therefore, it is important to apply resampling techniques to such data so
as the models perform to their best and give most of the accurate predictions.

You can check the entire implementation in my GitHub repository and try to apply them at your end. Do
explore other techniques that help in handling an imbalanced dataset.

Frequently Asked Questions

Q1. What is SMOTE?

A. SMOTE is an oversampling technique that generates synthetic samples from the minority class. It
obtains a synthetically class-balanced or nearly class-balanced training set, then trains the classifier.

Q2. What is smote used for?

A. Smote is used for synthetic minority oversampling in machine learning. It generates synthetic samples
to balance imbalanced datasets, specifically targeting the minority class.
Q3. When should you use smote?

A. Smote should be used when dealing with imbalanced datasets to improve the performance of machine
learning models on minority class predictions.

Q4. What is difference between smote and oversampling?

A. The main difference between SMOTE and oversampling techniques is that SMOTE generates synthetic
samples using interpolation between existing minority samples, while oversampling replicates existing
minority samples to balance the dataset.

Q5. Does smote reduce accuracy?

SMOTE may improve accuracy but can also create unrealistic synthetic samples, affecting accuracy. The
impact depends on the dataset and model.
Consider alternative techniques like undersampling, cost-sensitive learning, or ensemble methods.
Experiment with different techniques to determine the best approach for your dataset.

References:

Learning from Imbalanced Data Sets by Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati,
Bartosz Krawczyk, Francisco Herrera

Article Url - https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-smote-

techniques/

SWASTIK SATPATHY

Fraud Analytics Using Descriptive Predictive and Social Network Techniques A Guide To Data Science For Fraud Detection 1st Edition Bart Baesens Full Digital Chapters
100% (2)
Fraud Analytics Using Descriptive Predictive and Social Network Techniques A Guide To Data Science For Fraud Detection 1st Edition Bart Baesens Full Digital Chapters
107 pages
A B Testing
100% (1)
A B Testing
28 pages
UE20CS302 Unit3 Slides
No ratings yet
UE20CS302 Unit3 Slides
308 pages
Knowledge Discovery in Databases
No ratings yet
Knowledge Discovery in Databases
29 pages
Online Machine Learning Algorithms For Currency Exchange Prediction
No ratings yet
Online Machine Learning Algorithms For Currency Exchange Prediction
84 pages
A Beginners Guide To Deep Reinforcement Learning PDF
No ratings yet
A Beginners Guide To Deep Reinforcement Learning PDF
9 pages
Machine Learning - 2 Books in 1 - The Complete Guide For Beginners To Master Neural Networks, Artificial Intelligence, and Data Science With Python (BooksRack - Net)
No ratings yet
Machine Learning - 2 Books in 1 - The Complete Guide For Beginners To Master Neural Networks, Artificial Intelligence, and Data Science With Python (BooksRack - Net)
201 pages
Knowledge Graphs for AI Experts
No ratings yet
Knowledge Graphs for AI Experts
197 pages
Reinforcement Learning - Introduction
No ratings yet
Reinforcement Learning - Introduction
19 pages
05 Logistic - Regression
No ratings yet
05 Logistic - Regression
7 pages
6 XG Boost - Jupyter Notebook
100% (1)
6 XG Boost - Jupyter Notebook
3 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Real Estate ML Project Guide
No ratings yet
Real Estate ML Project Guide
20 pages
Lec 37
No ratings yet
Lec 37
13 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
O Theobald - Machine Learning For Absolute Beginners - A Plain English Introduction (Second Edition) (AI, Data Science, Python & Statistics Fo - Libgen - Li
No ratings yet
O Theobald - Machine Learning For Absolute Beginners - A Plain English Introduction (Second Edition) (AI, Data Science, Python & Statistics Fo - Libgen - Li
72 pages
CS178 Homework #1: Problem 0: Getting Connected
No ratings yet
CS178 Homework #1: Problem 0: Getting Connected
4 pages
ML Cheatsheet Final
No ratings yet
ML Cheatsheet Final
32 pages
Book 2.0 - Python
100% (1)
Book 2.0 - Python
143 pages
Run Databricks from Synapse Guide
No ratings yet
Run Databricks from Synapse Guide
12 pages
Time Series Using Python
No ratings yet
Time Series Using Python
47 pages
1694600777-Unit2.2 Logistic Regression CU 2.0
100% (1)
1694600777-Unit2.2 Logistic Regression CU 2.0
37 pages
Programming For Data Science - Assignment 1
No ratings yet
Programming For Data Science - Assignment 1
2 pages
Security and Privacy Issues in Recommender Systems
100% (1)
Security and Privacy Issues in Recommender Systems
15 pages
Advanced Certification in Data Science and Artificial Intelligence
No ratings yet
Advanced Certification in Data Science and Artificial Intelligence
18 pages
Deep Learning Guide: Installation to MLPs
No ratings yet
Deep Learning Guide: Installation to MLPs
986 pages
Statistical Machine Learning For Engineers With Application
No ratings yet
Statistical Machine Learning For Engineers With Application
393 pages
Rabies Clustering in Palembang Using K-Means
No ratings yet
Rabies Clustering in Palembang Using K-Means
8 pages
MACHINELEARING UNIT 1material
100% (1)
MACHINELEARING UNIT 1material
64 pages
Machine Learning Foundations - Overview
100% (1)
Machine Learning Foundations - Overview
24 pages
1.introduction To Python For Data Science
No ratings yet
1.introduction To Python For Data Science
6 pages
Building Powerful Image Classification Models Using Very Little Data
No ratings yet
Building Powerful Image Classification Models Using Very Little Data
20 pages
Computer Vision - Ipynb - Colaboratory
No ratings yet
Computer Vision - Ipynb - Colaboratory
17 pages
Clustering Iris Data With Weka
No ratings yet
Clustering Iris Data With Weka
6 pages
Fast Python High Performance Techniques For Large Datasets MEAP V10 Tiago Rodrigues Antao Instant Download
No ratings yet
Fast Python High Performance Techniques For Large Datasets MEAP V10 Tiago Rodrigues Antao Instant Download
110 pages
LDA 01 Linear Discriminant Analysis
No ratings yet
LDA 01 Linear Discriminant Analysis
65 pages
Unit 4
No ratings yet
Unit 4
108 pages
Career Plans For Next 2 Years
No ratings yet
Career Plans For Next 2 Years
11 pages
Python Data Science 3 Books in 1 - Hands On Learning For Beginners A Hands-On Guide Beyond The Basics A Hands-On Guide For Experts
No ratings yet
Python Data Science 3 Books in 1 - Hands On Learning For Beginners A Hands-On Guide Beyond The Basics A Hands-On Guide For Experts
358 pages
Bias-Variance Tradeoff Presentation
No ratings yet
Bias-Variance Tradeoff Presentation
11 pages
L 0007634413 PDF
0% (1)
L 0007634413 PDF
30 pages
Essential Python Libraries and Functions For Data Science 1706295212
No ratings yet
Essential Python Libraries and Functions For Data Science 1706295212
12 pages
Financial Data Analytics With R Monte-Carlo Validation (Jenny Chen) (Z-Library)
No ratings yet
Financial Data Analytics With R Monte-Carlo Validation (Jenny Chen) (Z-Library)
297 pages
Building and Evaluating ML Models
No ratings yet
Building and Evaluating ML Models
27 pages
Data Mart Info
No ratings yet
Data Mart Info
5 pages
17 Free Data Science Projects Guide
100% (1)
17 Free Data Science Projects Guide
9 pages
Bayesian Machine Learning
No ratings yet
Bayesian Machine Learning
127 pages
BERT
No ratings yet
BERT
21 pages
U02Lecture07 Classification
100% (1)
U02Lecture07 Classification
56 pages
Natural Language Processing Professional Program
No ratings yet
Natural Language Processing Professional Program
13 pages
05 ANN Artificial Neural Networks
No ratings yet
05 ANN Artificial Neural Networks
216 pages
Machine Lpipearning Interview Questions: Algorithms/Tp: Q1-What's The Trade-Off Between Bias and Variance?
No ratings yet
Machine Lpipearning Interview Questions: Algorithms/Tp: Q1-What's The Trade-Off Between Bias and Variance?
46 pages
Machine Learning Theory
100% (1)
Machine Learning Theory
12 pages
Machine Learning in Production
No ratings yet
Machine Learning in Production
31 pages
Predictive Analytics & Time Series
No ratings yet
Predictive Analytics & Time Series
54 pages
Data Visualization Ebook
No ratings yet
Data Visualization Ebook
15 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
56 pages
Lecture 2: Markov Decision Processes: David Silver
No ratings yet
Lecture 2: Markov Decision Processes: David Silver
57 pages
SMOTE For Imbalanced Classification With Python - GeeksforGeeks
No ratings yet
SMOTE For Imbalanced Classification With Python - GeeksforGeeks
18 pages
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
No ratings yet
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
33 pages
Smartcities 04 00069
No ratings yet
Smartcities 04 00069
24 pages
Engineering Inventory Management System
No ratings yet
Engineering Inventory Management System
1 page
Mplus User Guide Ver - 7 - r6 - Web
No ratings yet
Mplus User Guide Ver - 7 - r6 - Web
856 pages
Ec6405 - Control System Engineering Questions and Answers Unit - IV Stability Analysis Two Marks
No ratings yet
Ec6405 - Control System Engineering Questions and Answers Unit - IV Stability Analysis Two Marks
6 pages
Commutation Symbols
No ratings yet
Commutation Symbols
3 pages
Vladimir Gordin (Author) - Mathematical Problems and Methods of Hydrodynamic Weather Forecasting (2000, CRC Press) (10.1201 - 9781482287417) - Libgen - Li
No ratings yet
Vladimir Gordin (Author) - Mathematical Problems and Methods of Hydrodynamic Weather Forecasting (2000, CRC Press) (10.1201 - 9781482287417) - Libgen - Li
843 pages
Probability CA MS
No ratings yet
Probability CA MS
6 pages
Ef3450 2021B Mid
No ratings yet
Ef3450 2021B Mid
12 pages
David Ruet.T. Michael Shub
No ratings yet
David Ruet.T. Michael Shub
4 pages
NR-420803-Computer Application in Chemical Engg
No ratings yet
NR-420803-Computer Application in Chemical Engg
7 pages
AES URL Encryption Java Program
No ratings yet
AES URL Encryption Java Program
3 pages
Gmail - GoQuant Application
No ratings yet
Gmail - GoQuant Application
4 pages
Logistic Regression Using The SAS System Theory and Application 1st Edition Paul D. Allison PDF Download
No ratings yet
Logistic Regression Using The SAS System Theory and Application 1st Edition Paul D. Allison PDF Download
52 pages
Intro to Algorithms & Efficiency
No ratings yet
Intro to Algorithms & Efficiency
10 pages
SSLC Mathematics Important Questions Chap 10 EM POLYNOMIALS
No ratings yet
SSLC Mathematics Important Questions Chap 10 EM POLYNOMIALS
4 pages
Maths Class Xii Chapter 12 Linear Programming Practice Paper 13
50% (2)
Maths Class Xii Chapter 12 Linear Programming Practice Paper 13
5 pages
Midpoint Circle Drawing Algorithm
No ratings yet
Midpoint Circle Drawing Algorithm
25 pages
Ervin
No ratings yet
Ervin
4 pages
A Robust Regression Method Based On Exponential-Type Kernel Functions - de Carvalho Et Al
No ratings yet
A Robust Regression Method Based On Exponential-Type Kernel Functions - de Carvalho Et Al
47 pages
Approximations For Digital Computers
100% (2)
Approximations For Digital Computers
200 pages
Codechef Training Program
No ratings yet
Codechef Training Program
4 pages
0.1 Mmai Mma Gmma 869 Syllabus
No ratings yet
0.1 Mmai Mma Gmma 869 Syllabus
16 pages
3-AI 6th
No ratings yet
3-AI 6th
5 pages
Lecture Notes 5
No ratings yet
Lecture Notes 5
3 pages
Ds7201 Adip
No ratings yet
Ds7201 Adip
2 pages
Combining Blockchain and Iot For Decentralized Healthcare Data Management
No ratings yet
Combining Blockchain and Iot For Decentralized Healthcare Data Management
16 pages
First Term MTH
No ratings yet
First Term MTH
2 pages
Laskar 21
No ratings yet
Laskar 21
32 pages
Ensemble Learning: Proprietary Content. ©great Learning. All Rights Reserved. Unauthorized Use or Distribution Prohibited
No ratings yet
Ensemble Learning: Proprietary Content. ©great Learning. All Rights Reserved. Unauthorized Use or Distribution Prohibited
6 pages
Greedy Algorithms & Huffman Codes
No ratings yet
Greedy Algorithms & Huffman Codes
21 pages

SMOTE For Imbalanced Classification With Python

Uploaded by

SMOTE For Imbalanced Classification With Python

Uploaded by

SMOTE for Imbalanced Classification with Python

C LA S S I F I C AT I O N I NT E RM E D I AT E M A C HI NE LE A RNI NG PYT HO N S T RUC T URE D D AT A T E C HNI Q UE

This article was published as a part of the Data Science Blogathon.

The Accuracy Paradox

The Accuracy Paradox

Dealing with Imbalanced Data

SMOTE: Synthetic Minority Oversampling Technique

Python Code for SMOTE Algorithm

ADASYN: Adaptive Synthetic Sampling Approach

The below-given diagram represents the above procedure:

Python Code for ADASYN Algorithm

Hybridization: SMOTE + Tomek Links

Python Code for the SMOTE + Tomek Algorithm

Hybridization: SMOTE + ENN

Python Code for SMOTE + ENN Algorithm

Frequently Asked Questions

Q1. What is SMOTE?

Q2. What is smote used for?

Q4. What is difference between smote and oversampling?

Q5. Does smote reduce accuracy?

Article Url - https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-smote-

You might also like