Data Discretization
Data Discretization
INTRODUCTION
• Data binning is a common preprocessing technique used to
group intervals of continuous data into “bins” or “buckets”.
• Grouping data in bins (or buckets), in the sense that it
replaces value contained into a small interval with a single
representative for that interval.
• Binning is simply converting continuous values into at least
more than one discrete/categorical values.
• Binning process sometime improves accuracy by reducing
continuous values and help in dealing with missing values,
normalization, standardization and formatting.
Example
• We have loan amount of 15 customers from a bank in
dollars as
10000,20000,1000,500,700,850,900,1500,12000,16000
,1350,16000,8000,7500,850.
• Our task is to make 3 groups. Group 1 will have
customers with a loan amount between 0–1000, Group
2 between 1000–10000 and Group 3 between 10000–
20000.
Why Do We Need Binning?
1.Handling Outliers: Binning can reduce the impact of
outliers without removing data points.
2.Improving Model Performance: Some algorithms
perform better with categorical inputs (such as Bernoulli
Naive Bayes).
3.Simplifying Visualization: Binned data can be easier
to visualize and interpret.
4.Reducing Overfitting: It can prevent models from
fitting to noise in high-precision data.
Which Data Needs Binning?
1.Continuous variables with wide ranges: Variables
with a large spread of values can often benefit from
grouping.
2.Skewed distributions: Binning can help normalize
heavily skewed data.
3.Variables with outliers: Binning can handle the
effect of extreme values.
4.High-cardinality numerical data: Variables with
many unique values can be simplified through binning.
Data That Usually Does not Need
Binning
1.Already categorical data: Variables that are already in
discrete categories don’t need further binning.
2.Discrete numerical data with few unique values: If a
variable only has a small number of possible values,
binning might not provide additional benefit.
3.Numeric IDs or codes: These are meant to be unique
identifiers, not for analysis.
4.Time series data: While you can bin time series data, it
often requires specialized techniques and careful
consideration, but less common overall.
METHODS
• There are five methods of binning:
I. Binning by distance
II. Binning by frequency
III. Use between function
IV. Binning by sampling
V. Binning by Fisher-Jenks algorithm
The Dataset
• To demonstrate these binning techniques, we’ll be using
this artificial dataset. Say, this is the weather condition
in some golf course, collected on 15 different days.
Syntax
import pandas as pd
import numpy as np