[go: up one dir, main page]

0% found this document useful (0 votes)
6 views32 pages

Data Discretization

The document discusses data discretization through binning, a technique that groups continuous data into discrete categories to improve model performance and simplify visualization. It outlines various methods of binning, including equal-width, equal-frequency, custom, logarithmic, standard deviation-based, and K-means binning, providing examples for each method using a dataset related to weather conditions on a golf course. The document emphasizes the importance of choosing appropriate binning methods based on the characteristics of the data.

Uploaded by

namrata.valecha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views32 pages

Data Discretization

The document discusses data discretization through binning, a technique that groups continuous data into discrete categories to improve model performance and simplify visualization. It outlines various methods of binning, including equal-width, equal-frequency, custom, logarithmic, standard deviation-based, and K-means binning, providing examples for each method using a dataset related to weather conditions on a golf course. The document emphasizes the importance of choosing appropriate binning methods based on the characteristics of the data.

Uploaded by

namrata.valecha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

DATA DISCRETIZATION

INTRODUCTION
• Data binning is a common preprocessing technique used to
group intervals of continuous data into “bins” or “buckets”.
• Grouping data in bins (or buckets), in the sense that it
replaces value contained into a small interval with a single
representative for that interval.
• Binning is simply converting continuous values into at least
more than one discrete/categorical values.
• Binning process sometime improves accuracy by reducing
continuous values and help in dealing with missing values,
normalization, standardization and formatting.
Example
• We have loan amount of 15 customers from a bank in
dollars as
10000,20000,1000,500,700,850,900,1500,12000,16000
,1350,16000,8000,7500,850.
• Our task is to make 3 groups. Group 1 will have
customers with a loan amount between 0–1000, Group
2 between 1000–10000 and Group 3 between 10000–
20000.
Why Do We Need Binning?
1.Handling Outliers: Binning can reduce the impact of
outliers without removing data points.
2.Improving Model Performance: Some algorithms
perform better with categorical inputs (such as Bernoulli
Naive Bayes).
3.Simplifying Visualization: Binned data can be easier
to visualize and interpret.
4.Reducing Overfitting: It can prevent models from
fitting to noise in high-precision data.
Which Data Needs Binning?
1.Continuous variables with wide ranges: Variables
with a large spread of values can often benefit from
grouping.
2.Skewed distributions: Binning can help normalize
heavily skewed data.
3.Variables with outliers: Binning can handle the
effect of extreme values.
4.High-cardinality numerical data: Variables with
many unique values can be simplified through binning.
Data That Usually Does not Need
Binning
1.Already categorical data: Variables that are already in
discrete categories don’t need further binning.
2.Discrete numerical data with few unique values: If a
variable only has a small number of possible values,
binning might not provide additional benefit.
3.Numeric IDs or codes: These are meant to be unique
identifiers, not for analysis.
4.Time series data: While you can bin time series data, it
often requires specialized techniques and careful
consideration, but less common overall.
METHODS
• There are five methods of binning:
I. Binning by distance
II. Binning by frequency
III. Use between function
IV. Binning by sampling
V. Binning by Fisher-Jenks algorithm
The Dataset
• To demonstrate these binning techniques, we’ll be using
this artificial dataset. Say, this is the weather condition
in some golf course, collected on 15 different days.
Syntax
import pandas as pd
import numpy as np

# Create the dataset as a dictionary


data = {
'UVIndex': [2, 10, 1, 7, 3, 9, 5, 11, 1, 8, 3, 9, 11, 5, 7],
'Humidity': [15, 95, 10, 98, 18, 90, 25, 80, 95, 40, 20, 30, 85, 92, 12],
'WindSpeed': [2, 90, 1, 30, 3, 10, 40, 5, 60, 15, 20, 45, 25, 35, 50],
'RainfallAmount': [5,2,7,3,18,3,0,1,25,0,9,0,18,7,0],
'Temperature': [68, 60, 63, 55, 50, 56, 57, 65, 66, 68, 71, 72, 79, 83, 81],
'Crowdedness': [0.15, 0.98, 0.1, 0.85, 0.2, 0.9, 0.92, 0.25, 0.12, 0.99, 0.2, 0.8,
0.05, 0.3, 0.95]
}

# Create a DataFrame from the dictionary


df = pd.DataFrame(data)
Method 1: Equal-Width Binning
• Equal-width binning divides the range of a variable into a
specified number of intervals, all with the same width.
• Common Data Type: This method works well for data with a
roughly uniform distribution and when the minimum and
maximum values are meaningful.
• In our Case: Let’s apply equal-width binning to our UV Index
variable. We’ll create four bins: Low, Moderate, High, and Very
High. We chose this method for UV Index because it gives us a
clear, intuitive division of the index range, which could be
useful for understanding how different index ranges affect
golfing decisions.
Syntax
# 1. Equal-Width Binning for UVIndex
df['UVIndexBinned'] = pd.cut(df['UVIndex'], bins=4,
labels=['Low', 'Moderate', 'High', 'Very High'])
Method 2: Equal-Frequency Binning
(Quantile Binning)
• Equal-frequency binning creates bins that contain
approximately the same number of observations.
• Common Data Type: This method is particularly useful for
skewed data or when you want to make sure a balanced
representation across categories.
• In our Case: Let’s apply equal-frequency binning to our
Humidity variable, creating three bins: Low, Medium, and High.
We chose this method for Humidity because it ensures we have
an equal number of observations in each category, which can
be helpful if humidity values are not evenly distributed across
their range.
Syntax
# 2. Equal-Frequency Binning for Humidity
df['HumidityBinned'] = pd.qcut(df['Humidity'], q=3,
labels=['Low', 'Medium', 'High'])
Method 3: Custom Binning
• Custom binning allows you to define your own bin edges
based on domain knowledge or specific requirements.
• Common Data Type: This method is ideal when you
have specific thresholds that are meaningful in your
domain or when you want to focus on particular ranges
of values.
• In our Case: Let’s apply custom binning to our Rainfall
Amount. We chose this method for this column because
there are standardized categories for rain that are more
meaningful than arbitrary divisions.
Syntax
# 3. Custom Binning for RainfallAmount
df['RainfallAmountBinned'] = pd.cut(df['RainfallAmount'],
bins=[-np.inf, 2, 4, 12, np.inf],
labels=['No Rain', 'Drizzle', 'Rain', 'Heavy Rain'])
Method 4: Logarithmic Binning
• Logarithmic binning creates bins that grow exponentially in
size. The method basically applies log transformation first then
performs equal-width binning.
• Common Data Type: This method is particularly useful for
data that spans several orders of magnitude or follows a power
law distribution.
• In our Case: Let’s apply logarithmic binning to our Wind
Speed variable. We chose this method for Wind Speed because
the effect of wind on a golf ball’s trajectory might not be linear.
A change from 0 to 5 mph might be more significant than a
change from 20 to 25 mph.
Syntax
# 4. Logarithmic Binning for WindSpeed
df['WindSpeedBinned'] =
pd.cut(np.log1p(df['WindSpeed']), bins=3,
labels=['Light', 'Moderate', 'Strong'])
Method 5: Standard Deviation-based
Binning
• Standard Deviation based binning creates bins based on
the number of standard deviations away from the mean.
This approach is useful when working with normally
distributed data or when you want to bin data based on
how far values deviate from the central tendency.
• Variations: The exact number of standard deviations
used for binning can be adjusted based on the specific
needs of the analysis. The number of bins is typically
odd (to have a central bin). Some implementations
might use unequal bin widths, with narrower bins near
the mean and wider bins in the tails.
Method 5: Standard Deviation-based
Binning
• Common Data Type: This method is well-suited for
data that follows a normal distribution or when you
want to identify outliers and understand the spread of
your data. May not be suitable for highly skewed
distributions.
• In our Case: Let’s apply this binning method scaling to
our Temperature variable. We chose this method for
Temperature because it allows us to categorize
temperatures based on how they deviate from the
average, which can be particularly useful in
understanding weather patterns or climate trends.
Syntax
# 5. Standard Deviation-Based Binning for Temperature
mean_temp, std_dev = df['Temperature'].mean(), df['Temperature'].std()
bin_edges = [
float('-inf'), # Ensure all values are captured
mean_temp - 2.5 * std_dev,
mean_temp - 1.5 * std_dev,
mean_temp - 0.5 * std_dev,
mean_temp + 0.5 * std_dev,
mean_temp + 1.5 * std_dev,
mean_temp + 2.5 * std_dev,
float('inf') # Ensure all values are captured
]
df['TemperatureBinned'] = pd.cut(df['Temperature'], bins=bin_edges,
labels=['Very Low', 'Low', 'Below Avg', 'Average','Above Avg', 'High', 'Very
High'])
Method 6: K Means Binning
• K-Means binning uses the K-Means clustering algorithm to create
bins. It groups data points into clusters based on how similar the data
points are to each other, with each cluster becoming a bin.
• Common Data Type: This method is great for finding groups in data
that might not be obvious at first. It works well with data that has one
peak or several peaks, and it can adjust to the way the data is
organized.
• In our Case: Let’s apply K-Means binning to our Crowdedness
variable. We chose this method for Crowdedness because it might
reveal natural groupings in how busy the golf course gets, which
could be influenced by various factors not captured by simple
threshold-based binning.
Syntax
# 6. K-Means Binning for Crowdedness
kmeans = KMeans(n_clusters=3,
random_state=42).fit(df[['Crowdedness']])
df['CrowdednessBinned'] =
pd.Categorical.from_codes(kmeans.labels_,
categories=['Low', 'Medium', 'High'])
Conclusion
• We tried six different ways to ‘discretize’ the numbers in
our golf data. So, the final dataset now looks like this:
# Print only the binned columns
binned_columns = [col for col in df.columns if
col.endswith('Binned')]
print(df[binned_columns])

You might also like