Data Discretization

The document discusses data discretization through binning, a technique that groups continuous data into discrete categories to improve model performance and simplify visualization. It outlines various methods of binning, including equal-width, equal-frequency, custom, logarithmic, standard deviation-based, and K-means binning, providing examples for each method using a dataset related to weather conditions on a golf course. The document emphasizes the importance of choosing appropriate binning methods based on the characteristics of the data.

Uploaded by

namrata.valecha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views32 pages

Data Discretization

Uploaded by

namrata.valecha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 32

DATA DISCRETIZATION

INTRODUCTION
• Data binning is a common preprocessing technique used to
group intervals of continuous data into “bins” or “buckets”.
• Grouping data in bins (or buckets), in the sense that it
replaces value contained into a small interval with a single
representative for that interval.
• Binning is simply converting continuous values into at least
more than one discrete/categorical values.
• Binning process sometime improves accuracy by reducing
continuous values and help in dealing with missing values,
normalization, standardization and formatting.
Example
• We have loan amount of 15 customers from a bank in
dollars as
10000,20000,1000,500,700,850,900,1500,12000,16000
,1350,16000,8000,7500,850.
• Our task is to make 3 groups. Group 1 will have
customers with a loan amount between 0–1000, Group
2 between 1000–10000 and Group 3 between 10000–
20000.
Why Do We Need Binning?
1.Handling Outliers: Binning can reduce the impact of
outliers without removing data points.
2.Improving Model Performance: Some algorithms
perform better with categorical inputs (such as Bernoulli
Naive Bayes).
3.Simplifying Visualization: Binned data can be easier
to visualize and interpret.
4.Reducing Overfitting: It can prevent models from
fitting to noise in high-precision data.
Which Data Needs Binning?
1.Continuous variables with wide ranges: Variables
with a large spread of values can often benefit from
grouping.
2.Skewed distributions: Binning can help normalize
heavily skewed data.
3.Variables with outliers: Binning can handle the
effect of extreme values.
4.High-cardinality numerical data: Variables with
many unique values can be simplified through binning.
Data That Usually Does not Need
Binning
1.Already categorical data: Variables that are already in
discrete categories don’t need further binning.
2.Discrete numerical data with few unique values: If a
variable only has a small number of possible values,
binning might not provide additional benefit.
3.Numeric IDs or codes: These are meant to be unique
identifiers, not for analysis.
4.Time series data: While you can bin time series data, it
often requires specialized techniques and careful
consideration, but less common overall.
METHODS
• There are five methods of binning:
I. Binning by distance
II. Binning by frequency
III. Use between function
IV. Binning by sampling
V. Binning by Fisher-Jenks algorithm
The Dataset
• To demonstrate these binning techniques, we’ll be using
this artificial dataset. Say, this is the weather condition
in some golf course, collected on 15 different days.
Syntax
import pandas as pd
import numpy as np

# Create the dataset as a dictionary

data = {
'UVIndex': [2, 10, 1, 7, 3, 9, 5, 11, 1, 8, 3, 9, 11, 5, 7],
'Humidity': [15, 95, 10, 98, 18, 90, 25, 80, 95, 40, 20, 30, 85, 92, 12],
'WindSpeed': [2, 90, 1, 30, 3, 10, 40, 5, 60, 15, 20, 45, 25, 35, 50],
'RainfallAmount': [5,2,7,3,18,3,0,1,25,0,9,0,18,7,0],
'Temperature': [68, 60, 63, 55, 50, 56, 57, 65, 66, 68, 71, 72, 79, 83, 81],
'Crowdedness': [0.15, 0.98, 0.1, 0.85, 0.2, 0.9, 0.92, 0.25, 0.12, 0.99, 0.2, 0.8,
0.05, 0.3, 0.95]
}

# Create a DataFrame from the dictionary

df = pd.DataFrame(data)
Method 1: Equal-Width Binning
• Equal-width binning divides the range of a variable into a
specified number of intervals, all with the same width.
• Common Data Type: This method works well for data with a
roughly uniform distribution and when the minimum and
maximum values are meaningful.
• In our Case: Let’s apply equal-width binning to our UV Index
variable. We’ll create four bins: Low, Moderate, High, and Very
High. We chose this method for UV Index because it gives us a
clear, intuitive division of the index range, which could be
useful for understanding how different index ranges affect
golfing decisions.
Syntax
# 1. Equal-Width Binning for UVIndex
df['UVIndexBinned'] = pd.cut(df['UVIndex'], bins=4,
labels=['Low', 'Moderate', 'High', 'Very High'])
Method 2: Equal-Frequency Binning
(Quantile Binning)
• Equal-frequency binning creates bins that contain
approximately the same number of observations.
• Common Data Type: This method is particularly useful for
skewed data or when you want to make sure a balanced
representation across categories.
• In our Case: Let’s apply equal-frequency binning to our
Humidity variable, creating three bins: Low, Medium, and High.
We chose this method for Humidity because it ensures we have
an equal number of observations in each category, which can
be helpful if humidity values are not evenly distributed across
their range.
Syntax
# 2. Equal-Frequency Binning for Humidity
df['HumidityBinned'] = pd.qcut(df['Humidity'], q=3,
labels=['Low', 'Medium', 'High'])
Method 3: Custom Binning
• Custom binning allows you to define your own bin edges
based on domain knowledge or specific requirements.
• Common Data Type: This method is ideal when you
have specific thresholds that are meaningful in your
domain or when you want to focus on particular ranges
of values.
• In our Case: Let’s apply custom binning to our Rainfall
Amount. We chose this method for this column because
there are standardized categories for rain that are more
meaningful than arbitrary divisions.
Syntax
# 3. Custom Binning for RainfallAmount
df['RainfallAmountBinned'] = pd.cut(df['RainfallAmount'],
bins=[-np.inf, 2, 4, 12, np.inf],
labels=['No Rain', 'Drizzle', 'Rain', 'Heavy Rain'])
Method 4: Logarithmic Binning
• Logarithmic binning creates bins that grow exponentially in
size. The method basically applies log transformation first then
performs equal-width binning.
• Common Data Type: This method is particularly useful for
data that spans several orders of magnitude or follows a power
law distribution.
• In our Case: Let’s apply logarithmic binning to our Wind
Speed variable. We chose this method for Wind Speed because
the effect of wind on a golf ball’s trajectory might not be linear.
A change from 0 to 5 mph might be more significant than a
change from 20 to 25 mph.
Syntax
# 4. Logarithmic Binning for WindSpeed
df['WindSpeedBinned'] =
pd.cut(np.log1p(df['WindSpeed']), bins=3,
labels=['Light', 'Moderate', 'Strong'])
Method 5: Standard Deviation-based
Binning
• Standard Deviation based binning creates bins based on
the number of standard deviations away from the mean.
This approach is useful when working with normally
distributed data or when you want to bin data based on
how far values deviate from the central tendency.
• Variations: The exact number of standard deviations
used for binning can be adjusted based on the specific
needs of the analysis. The number of bins is typically
odd (to have a central bin). Some implementations
might use unequal bin widths, with narrower bins near
the mean and wider bins in the tails.
Method 5: Standard Deviation-based
Binning
• Common Data Type: This method is well-suited for
data that follows a normal distribution or when you
want to identify outliers and understand the spread of
your data. May not be suitable for highly skewed
distributions.
• In our Case: Let’s apply this binning method scaling to
our Temperature variable. We chose this method for
Temperature because it allows us to categorize
temperatures based on how they deviate from the
average, which can be particularly useful in
understanding weather patterns or climate trends.
Syntax
# 5. Standard Deviation-Based Binning for Temperature
mean_temp, std_dev = df['Temperature'].mean(), df['Temperature'].std()
bin_edges = [
float('-inf'), # Ensure all values are captured
mean_temp - 2.5 * std_dev,
mean_temp - 1.5 * std_dev,
mean_temp - 0.5 * std_dev,
mean_temp + 0.5 * std_dev,
mean_temp + 1.5 * std_dev,
mean_temp + 2.5 * std_dev,
float('inf') # Ensure all values are captured
]
df['TemperatureBinned'] = pd.cut(df['Temperature'], bins=bin_edges,
labels=['Very Low', 'Low', 'Below Avg', 'Average','Above Avg', 'High', 'Very
High'])
Method 6: K Means Binning
• K-Means binning uses the K-Means clustering algorithm to create
bins. It groups data points into clusters based on how similar the data
points are to each other, with each cluster becoming a bin.
• Common Data Type: This method is great for finding groups in data
that might not be obvious at first. It works well with data that has one
peak or several peaks, and it can adjust to the way the data is
organized.
• In our Case: Let’s apply K-Means binning to our Crowdedness
variable. We chose this method for Crowdedness because it might
reveal natural groupings in how busy the golf course gets, which
could be influenced by various factors not captured by simple
threshold-based binning.
Syntax
# 6. K-Means Binning for Crowdedness
kmeans = KMeans(n_clusters=3,
random_state=42).fit(df[['Crowdedness']])
df['CrowdednessBinned'] =
pd.Categorical.from_codes(kmeans.labels_,
categories=['Low', 'Medium', 'High'])
Conclusion
• We tried six different ways to ‘discretize’ the numbers in
our golf data. So, the final dataset now looks like this:
# Print only the binned columns
binned_columns = [col for col in df.columns if
col.endswith('Binned')]
print(df[binned_columns])

(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
Python Libraries Cheat Sheets
No ratings yet
Python Libraries Cheat Sheets
6 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
2-Binning-Techniques-in-Data-Mining-with-Examples
No ratings yet
2-Binning-Techniques-in-Data-Mining-with-Examples
10 pages
DWM Practical 113
No ratings yet
DWM Practical 113
24 pages
data science practicals
No ratings yet
data science practicals
47 pages
Pandas Cheat Sheet
100% (2)
Pandas Cheat Sheet
6 pages
DWDM Lab Manual 28.04.25-9-14
No ratings yet
DWDM Lab Manual 28.04.25-9-14
6 pages
24UCS172-S6
No ratings yet
24UCS172-S6
19 pages
Lec 6 Data Preprocessing using R
No ratings yet
Lec 6 Data Preprocessing using R
84 pages
lecture4
No ratings yet
lecture4
60 pages
Topic 05 - Data Preprocessing
No ratings yet
Topic 05 - Data Preprocessing
62 pages
Privacy-preserving Feature Selection a Survery-2020
No ratings yet
Privacy-preserving Feature Selection a Survery-2020
29 pages
pp DWDM 4 5
No ratings yet
pp DWDM 4 5
26 pages
DWDM_Lecture_ppt_Unit3_Part3
No ratings yet
DWDM_Lecture_ppt_Unit3_Part3
29 pages
data integration and binning
No ratings yet
data integration and binning
4 pages
331 MT2 STUDY
No ratings yet
331 MT2 STUDY
30 pages
EDP-3[2]
No ratings yet
EDP-3[2]
16 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
32 pages
DWM Practical
No ratings yet
DWM Practical
4 pages
Lecture 14
No ratings yet
Lecture 14
33 pages
Data Transformation (1)
No ratings yet
Data Transformation (1)
16 pages
Eda
No ratings yet
Eda
48 pages
ml lab exam document
No ratings yet
ml lab exam document
14 pages
ML_Notes
No ratings yet
ML_Notes
44 pages
T3_SCHEME_24_25
No ratings yet
T3_SCHEME_24_25
4 pages
02 - ML - Data Presentation-24-03-09
No ratings yet
02 - ML - Data Presentation-24-03-09
21 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Handling Using Pandas-II
No ratings yet
Data Handling Using Pandas-II
55 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
11.feature Selection, Extraction
No ratings yet
11.feature Selection, Extraction
38 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
Session 2 on Discreatization - Binning Notes
No ratings yet
Session 2 on Discreatization - Binning Notes
14 pages
Binning
No ratings yet
Binning
6 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
17 pages
Exp2 - Data Visualization and Cleaning and Feature Selection
No ratings yet
Exp2 - Data Visualization and Cleaning and Feature Selection
13 pages
BDS306B_Module5
No ratings yet
BDS306B_Module5
5 pages
binning-1
No ratings yet
binning-1
3 pages
Binning or Discretization
No ratings yet
Binning or Discretization
9 pages
Feature Engineering
No ratings yet
Feature Engineering
35 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
4 Binning
No ratings yet
4 Binning
19 pages
Binning
No ratings yet
Binning
4 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Data Assigment 1
100% (2)
Data Assigment 1
32 pages
Binnnig Using Python (2)
No ratings yet
Binnnig Using Python (2)
2 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
4 - Finding and Fixing Data Quality Issues
No ratings yet
4 - Finding and Fixing Data Quality Issues
48 pages
Data Preparation For Analytics Using SAS
100% (1)
Data Preparation For Analytics Using SAS
440 pages
Statistics 1 Qualifier MIQ
No ratings yet
Statistics 1 Qualifier MIQ
28 pages
Exp 5
No ratings yet
Exp 5
11 pages
04.05-Histograms-and-Binnings - Ipynb - Colaboratory
No ratings yet
04.05-Histograms-and-Binnings - Ipynb - Colaboratory
7 pages
Feature Eng Cheat Sheet
No ratings yet
Feature Eng Cheat Sheet
5 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
Informatics Practices Class 12 Cbse Notes Data Handling
0% (1)
Informatics Practices Class 12 Cbse Notes Data Handling
17 pages
Univariate Statistics w24 Update
No ratings yet
Univariate Statistics w24 Update
144 pages
Introduction to Python
No ratings yet
Introduction to Python
71 pages
Data Preprocessing
No ratings yet
Data Preprocessing
84 pages
Data Preprocessing With Python For Absolute Beginners. Step by Step. AI Publishing
100% (1)
Data Preprocessing With Python For Absolute Beginners. Step by Step. AI Publishing
252 pages
Data Visualization
No ratings yet
Data Visualization
24 pages
Biostat Usmle Step 1
No ratings yet
Biostat Usmle Step 1
34 pages
Mathematics in The Modern World (Introduction To Statistics)
No ratings yet
Mathematics in The Modern World (Introduction To Statistics)
60 pages
Abdukerim MSC Part One
100% (1)
Abdukerim MSC Part One
24 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
Hezlin PHD Editted14022023
No ratings yet
Hezlin PHD Editted14022023
58 pages
Lecture 03 Logistic Regression
No ratings yet
Lecture 03 Logistic Regression
34 pages
Splusdiscrete2 Manual
No ratings yet
Splusdiscrete2 Manual
281 pages
It0089 Finalreviewer
No ratings yet
It0089 Finalreviewer
143 pages
Political Analysis GOVT2991: Semester 2, 2020 Exam Notes
No ratings yet
Political Analysis GOVT2991: Semester 2, 2020 Exam Notes
51 pages
BRM Solved Answers Part A
No ratings yet
BRM Solved Answers Part A
45 pages
Quantitative Research
No ratings yet
Quantitative Research
6 pages
Chapter 3
0% (1)
Chapter 3
14 pages
Analysis of Covariance-ANCOVA-with Two Groups
No ratings yet
Analysis of Covariance-ANCOVA-with Two Groups
41 pages
Sample Size Determination For Research
No ratings yet
Sample Size Determination For Research
9 pages
A Tutorial On Testing Visualizing and Probing An Interaction Involving A Multicategorical Variable in Linear Regression Analysis
No ratings yet
A Tutorial On Testing Visualizing and Probing An Interaction Involving A Multicategorical Variable in Linear Regression Analysis
31 pages
DADM - Tools Help
No ratings yet
DADM - Tools Help
24 pages
Part-A Assignment No. 3
No ratings yet
Part-A Assignment No. 3
2 pages
Wombat Statistical Analysis U3227719
No ratings yet
Wombat Statistical Analysis U3227719
5 pages
MTH213: Business Statistics: Chapter 1: Defining and Collecting Data
No ratings yet
MTH213: Business Statistics: Chapter 1: Defining and Collecting Data
19 pages
Multiple Regression With Serial
No ratings yet
Multiple Regression With Serial
15 pages
Hrm&a Group Assignment
No ratings yet
Hrm&a Group Assignment
8 pages
Different Types of Data Analysis - Data Analysis Methods and Techniques in Research Projects
No ratings yet
Different Types of Data Analysis - Data Analysis Methods and Techniques in Research Projects
9 pages
Challenges For Small and Micro Enterprises in Accessing Finance (Case of Wolaita Soddo Town)
No ratings yet
Challenges For Small and Micro Enterprises in Accessing Finance (Case of Wolaita Soddo Town)
10 pages
International Journal of Educational Man PDF
No ratings yet
International Journal of Educational Man PDF
10 pages
Syllabus Statistics1
No ratings yet
Syllabus Statistics1
2 pages
App 3035 - Lecture 1 Research Design
No ratings yet
App 3035 - Lecture 1 Research Design
16 pages
Data Viz - Types & Examples
No ratings yet
Data Viz - Types & Examples
5 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet

Data Discretization

Uploaded by

Data Discretization

Uploaded by

DATA DISCRETIZATION

# Create the dataset as a dictionary

# Create a DataFrame from the dictionary

You might also like