Data Preprocessing
Agenda
• What and Why data preprocessing?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
What is Data Preprocessing?
It is data mining technique that involves transforming
raw data into an understandable format.
Why Data Preprocessing?
• Data in the real world is dirty
– Incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
– Incorrect/Error: Collection instrument may be faulty, mandatory
field of personal information may contain wrong data
– Noisy: containing errors or outliers
– Inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data
Multi-Dimensional Measure of Data Quality
◻ A well-accepted multidimensional view:
Accuracy: How well does a piece of information reflect reality?
Completeness Does it fulfill your expectations of what’s comprehensive?
Consistency Does information stored in one place match relevant data
stored elsewhere?
Timeliness Is your information available when you need it?
Believability Howmuch data are trusted by user
Interpretability How easily the data are understood
Accessibility where data resides and how to retrieve it.
Value added Is the stored data adding value in the mining process
Broad categories:
contextual, representational, and accessibility.
Forms of data preprocessing
Agenda
• Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
1. Data Cleaning
• Data Cleaning Tasks
1.1- Fill in missing values
1.2- Identify outliers and smooth out noisy data
1.3- Correct inconsistent data
Data Cleaning: Missing Data
• Data is not available/ Missing data may be due to
E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
Information is not collected :(e.g., people decline to give their age and
weight)
Attributes may not be applicable to all cases (e.g., annual income is
not applicable to children )
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
not register history or changes of the data
• Missing data need to be inferred
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing
(assuming the task is classification—not effective in certain cases)
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?! simple but not foolproof.
• Use the central tendency (mean/median) to fill in the missing
value. Normal->Mean, Skewed->Median
• Use the most probable value to fill in the missing value:
inference-based such as regression, Bayesian formula, decision tree
Bias the data
Types of Missing Values
Why missing data is a problem?
Ans: It creates bias in the data. because we don’t know that the data is
missing randomly/missedout/intensionally.
*Bias data: produce lack of prdictivity & trustworthyness
• Missing completely at random (MCAR)
• Missing at Random (MAR)
• Missing Not at Random (MNAR)
Missing Completely at Random (MCAR) (Types of Missing Values…)
Assumption: If a person has missing data then it is completely unrelated
to the other information in the data. The missingness on the variable is
completely unsystematic.
– Missingness of a value is independent of attributes
– Fill in values based on the attribute
– Analysis may be unbiased overall
Example when we take a random sample of a population, where
each member has the same chance of being included in the sample.
When data is missing completely at random, it means that we can undertake analyses
using only observations that have complete data (provided we have enough of such
observations).
Missing at Random (MAR) Types of Missing Values…
– Missingness is related to other variables
– Fill in values based other values
– Almost always produces a bias in the analysis
Example of MAR is when we take a sample from a population, where
the probability to be included depends on some known property.
A simple predictive model is that
income can be predicted based on
gender and age. Looking at the table,
we note that our missing value is for a
Female aged 30 or more, and
observations say the other females
aged 30 or more have a High income.
As a result, we can predict that the
missing value should be High.
There is a systematic relationship between the inclination of missing values
and the observed data. All that is required is a probabilistic relationship
Missing not at Random (MNAR) - Nonignorable
Types of Missing Values…
– Missingness is related to unobserved measurements and they are
not random
– The missing values are related to the values of that variable itself,
even after controlling for other variables.
MNAR means that the probability of being missing varies for
reasons that are unknown to us.
Example: when smoking status is not recorded in patients admitted as
an emergency with an intention (not random), then it is more likely to
have worse outcomes from surgery.
Strategies to handle MNAR are to find more data about the causes for the
missingness, or to perform what-if analyses to see how sensitive the results are under
various scenarios.
Data Cleaning: Identify outliers and smooth out
noisy data
• Noise
Random error or variance in a measured variable.
Or simply meaningless data that can’t be interpreted by machines.
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in the naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
How to Handle Noisy Data?
1.2.1 Binning method for data Smoothing:
– first sort data and partition it into (equi-depth) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
– used also for discretization (discussed later)
1.2.2 Clustering
– detect and remove outliers
1.2.4 Regression
– smooth by fitting the data into regression functions
1.2.1.Simple Discretization Method: Binning
• Data smoothing refers to a statistical approach of eliminating noise
and outliers from datasets to make the patterns more noticeable
• Binning or bucketing is used to smoothing the data. It smooth a
sorted data value by consulting its “neighborhood” i.e. the values
around it.
• The sorted values are distributed into a number of “buckets,” or
bins of equal width or equal frequency
• Number of Bins or Buckets = square root of the no of data points
• Bin Width/Depth: No of objects/elements in a single bin.
• There are 2 methods of dividing data into bins
– Equal Width Binning -Equal Frequency Binning
1.2.1 Binning Methods
For data set: 0, 5, 14, 15, 17, 18, 22, 25, 27 (sorted)
No. of Bins/ Bin size = 3 [3*3=9]
Equal Width Binning Equal Frequency Binning
• Bins have equal width with a range
of each bin are defined as Make bins according to bin size
with equal frequency/depth i.e.
[min + w], [min + 2w] …. [min + nw]
equal elements i.e. 9/3=3
where w = (max – min) / (no of bins)
= (27-0)/3=9
• Bin size=3
• How do I use that 9 to make the
• Bin 1: 0, 5, 14
bins?
• Bin 2: 15, 17, 18
1. 0 + 9 = 9 (from 0 to 9) = Bin 1:
0, 5 • Bin 3: 22, 25, 27
2. 9 + 9 = 18 (from 9+ to 18) =
Bin 2 : 14, 15, 17, 18
3. 18 + 9 = 27 (from 18+ to 27)
= Bin 3 : 22, 25, 27
Types of Smoothing
in
Equal Frequency Bins and Equal Width Bins
• Smoothing by Mean
• Smoothing by Median
• Smoothing by Boundaries
Smoothing the data by Equal Frequency Bins
Step-1 Sort the data in ascending order:
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Step-2: Number of Bins =
Step-3 Partition into equal frequency (equi-depth) of 3 bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
Smoothing the data by Equal Frequency Bins contd..
1. Smoothing by BIN MEANS: Find the mean values of
each bin and Replace all with mean values
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
2. Smoothing by BIN MEDIANS: Find the median
values of each bin and Replace all with the median
- Bin 1: 8.5, 8.5, 8.5, 8.5
- Bin 2: 22.5, 22.5, 22.5, 22.5
- Bin 3: 28.5, 28.5, 28.5, 28.5
3. Smoothing by BIN BOUNDARIES: Min and Max will be
the Bin boundary, and middle element will be replaced by
the closet boundary value
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Smoothing the data by Equal Width Bins
Step-1 Sort the data in ascending order:
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Step-2: Number of Bins =
Step-3: Partition into equal width (equi-width) bins:
[min + w], [w + 2w] …. [(n-1)w + nw]
where w = (max – min) / (no of bins)
= (34-4)/3=10
Partition into equal width of 3 bins:
- Bin 1: 4, 8, 9 (4 to14)
- Bin 2: 15, 21, 21, 24 (14+ to 24)
- Bin 3: 25, 26, 28, 29, 34 (24+ to 34)
Smoothing the data by Equal Width Bins contd..
1. Smoothing by BIN MEANS: Find the mean values of
each bin and Replace all with mean values
- Bin 1: 7, 7, 7
- Bin 2: 20.25, 20.25, 20.25, 20.25
- Bin 3: 28.4, 28.4, 28.4, 28.4
2. Smoothing by BIN MEDIANS: Find the median values
of each bin and Replace all with the median
- Bin 1: 8, 8, 8, 8
- Bin 2: 21, 21, 21, 21
- Bin 3: 28, 28, 28, 28
3. Smoothing by BIN BOUNDARIES: Min and Max will be the Bin
boundary, and middle element will be replaced by the closet boundary
value
- Bin 1: 4, 9, 9
- Bin 2: 15, 24, 24, 24 (14+ to 24)
- Bin 3: 25, 25, 25, 25, 34 (24+ to 34)
Data Cleaning: How to Handle Inconsistent
Data?
• Manual correction using external references
• Semi-automatic using various tools
– To detect violations of known functional dependencies
and data constraints
– To correct redundant data
Agenda
• Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
2. Data Integration and Transformation
Data Integration
– It is a preprocessing method that combines data from multiple
sources into a coherent store i.e. data warehouse
• Issues to be Considered
a. Entity Identification Problem (Schema integration and
object matching) :- How can equivalent realworld-entities from
multiple data sources can be matched up?
Ex. customer_id in one database and cust_number in another
Solution: Metadata can be helpful to avoid such errors in schema
b. Redundancy (Correlation Analysis)
Ex. Age and DoB in the same schema,
Solution: Remove the unnecessary attribute if it can be derived from another.
c. Detecting and resolving data value conflicts
For the same real-world entity, attribute values from different sources are different.
Possible reasons (Ex.): different representations, different scales, e.g., metric vs.
British units, different currency
Solution: Correctly modify the values
Data Integration: Handling Redundant Data
• Redundant data occur often when integrating multiple DBs
– The same attribute may have different names in different databases or one
attribute may be a “derived” attribute in another table, e.g., annual revenue, age
• Redundant data may be detected by correlational analysis
– Given two attributes, correlation analysis can measure how strongly one
attribute implies another, based on available data.
■ Correlation coefficient for numerical data values: Can be computed by
Pearson’s product moment coefficient named after Karl Pearson
■ Chi-square test for categorical or discrete data values: A chi-square test is
a statistical test used to compare observed results with expected results
Where
Data Integration: Handling Redundant Data ...
Correlation Analysis (Nominal Data)
(chi-square) test
Eq.(1)
where oij is the observed frequency (i.e., actual count) of the joint
event (Ai,Bj) and eij is the expected frequency of (Ai,Bj), which can be
computed as
Eq.(2)
The larger the value, the more likely the variables are correlated.
Data Integration: Handling Redundant Data ...
Chi-Square Calculation: An Example
• Suppose that a group of 1500 people was surveyed. The gender of each person
was noted. Each person was polled as to whether his or her preferred type of
reading material was fiction or nonfiction.
• Thus, we have two attributes, gender and preferred reading. The observed
frequency (or count) of each possible joint event is summarized in the contingency
table.
• The expected frequencies are calculated based on the data distribution for both
attributes using Eq. (2).
• The expected frequency for the cell (male, fiction) is
• Similarly, calculate the expected frequency for each cell of the contingency table.
Data Integration: Handling Redundant Data ...
Chi-Square Calculation: An Example
Male Male Female Male Sum (row)
Observed Expected Observed Expected
Like science 250 90 200 360 450
fiction
Not like science 50 210 1000 840 1050
fiction
Sum(col.) 300 1200 1500
• (chi-square) calculation (numbers in red colors are expected counts calculated based on
the data distribution in the two categories)
• Degree of freedom = (row-1) * (col-1). For this 2 × 2 table, the degrees of freedom are (2 −
1)(2 − 1) = 1. For 1 degree of freedom, the χ2 value needed to reject the hypothesis at the
0.001 significance level is 10.828 (taken from the table of upper percentage points of the χ2
distribution, typically available from any textbook on statistics).
• As computed value (507.93) is more than 10.828. Hence, preferred reading and gender are
correlated.
Data Integration: Handling Redundant Data ...
Correlation Analysis (Numeric Data)
• For numeric attributes, we can evaluate the correlation between two
attributes, A and B, by computing the correlation coefficient (also
known as Pearson’s product moment coefficient, named after its
inventer, Karl Pearson).
• where n is the number of tuples, ai and bi are the respective values
of A and B in tuple i, and are the respective mean values of A
and B, σA and σB are the respective standard deviations of A and B.
Data Integration: Handling Redundant Data ...
Correlation Analysis (Numeric Data) contd...
• Note that −1 ≤ rA,B ≤ +1. If rA,B is greater than 0, then A and B are
positively correlated, meaning that the values of A increase as the
values of B increase.
• The higher the value, the stronger the correlation (i.e., the more each
attribute implies the other). Hence, a higher value may indicate that
A (or B) may be removed as a redundancy.
• If the resulting value is equal to 0, then A and B are independent and
there is no correlation between them.
• If the resulting value is less than 0, then A and B are negatively
correlated, where the values of one attribute increase as the values of
the other attribute decrease.
• Scatter plots can also be used to view correlations between attributes.
Data Integration: Handling Redundant Data ...
Correlation Analysis (Numeric Data)
Example:
Consider the stock prices listed in the below table
Find out wheather AllElectronics and HighTeck prices are correlated
or not?
Data Integration: Handling Redundant Data ...
Correlation Analysis (Numeric Data)
Step-1: Calculate the mean of the attribute AllElectronics (A) and
HighTech (B).
Mean(A)=(6+5+4+3+2)/5=4
Mean(B)=(20+10+14+5+5)/5=10.8
Step-2: calculate the standard deviation of both attributes A and B i.e.
1.414 and 5.706 respectively
Step-3: calculate the correlation value using the given equation
Step-4: calculated value is greater than 0, hence both are correlated.
Correlation coefficient
A measure of the strength of linear association between two variables.
Correlation will always between -1.0 and +1.0.
Correlation can be:
a. Positive Correlation
b. Negative Relation
c. Null Correlation
a. Positive Correlation:
The correlation in the same direction is called positive correlation. If
one variable increase other is also increase and one variable decrease
other is also decrease. For example, the length of an iron bar will
increase as the temperature increases.
b. Negative Correlation:
The correlation in opposite direction is called negative correlation, if
with the increase in one variable the other decreases and vice versa.
Example, the volume of gas will decrease as the pressure increase or the
demand of a particular commodity is increase as price of such 34
commodity is decrease.
Data Integration: Handling Redundant Data ...
Correlation coefficient contd..
c. No Correlation or Zero Correlation:
If there is no relationship between the two variables such that the value
of one variable change and the other variable remain constant is called
no or zero correlation.
2. Data Integration and Transformation
Data Transformation
It maps the values of one attribute to another new set of replacement
values so that each old value can be identified with one of the new values
Strategies for Data Normalization are:
• Smoothing: remove noise from data (binning, clustering, regression)
• Aggregation: summarization process is applied, data cube construction
• Generalization/Concept hierarchy climbing: Attributes can be
generalized to higher-level concept
• Normalization: scaled to fall within a small, specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
• Attribute/feature construction: New attributes are constructed from the
given ones to help the mining process
• Discretization: Raw values will be replaced by interval labels/ conceptual labels
Data Transformation: Data Aggregation
• Data aggregation is any process in which data is brought together and
conveyed in a summary form. It is typically used prior to the performance
of a statistical analysis.
• Combining two or more attributes (or objects) into a single attribute
(or object).
• Data aggregation generally works on the big data or data marts that do not
provide enough information value as a whole.
Aggregation with mathematical functions:
• Sum -Adds together all the specified data to get a total.
• Average -Computes the average value of the specific data.
• Max -Displays the highest value for each category.
• Min -Displays the lowest value for each category.
• Count -Counts the total number of data entries for each category.
Data can also be aggregated by date, allowing trends to be shown over a
period of years, quarters, months, etc.
Data Transformation: Data Normalization
Data normalization makes data easier to classify and understand. It is used
to scale the data of an attribute so that it falls in a smaller range
Need of Normalization?
• Normalization is generally required when multiple attributes are there but attributes
have values on different scales, this may lead to poor data models while performing
data mining operations.
• Otherwise, it may lead to a dilution in effectiveness of an important equally
important attribute(on lower scale) because of other attribute having values on
larger scale.
• Heterogenous data with different units usually needs to be normalized. Otherwise,
data has the same unit and same order of magnitude it might not be necessary with
normalization.
• Unless normalized at pre-processing, variables with disparate ranges or varying
precision acquire different driving values.
Data Transformation: Data Normalization contd..
Example
Chart for Raw Data
Chart for Normalized Data
Data Transformation: Data Normalization contd..
Methods of Data Normalization:
a. Decimal Scaling
b. Min-Max Normalization
c. z-Score Normalization(zero-mean Normalization)
There are several approaches in normalisation which can be used in
deep learning models.
⚫ Batch Normalization
⚫ Layer Normalization
⚫ Group Normalization
⚫ Instance Normalization
⚫ Weight Normalization
Data Transformation: Data Normalization contd..
a. Decimal Scaling Normalization
- It normalizes by moving the decimal point of values of the data.
- To normalize the data by this technique, we divide each data value by the
maximum absolute value of data set.
- The data value, v , of data is normalized to v' by using the formula
i i
[where j is the smallest integer such that max(|v' |)<1.]
i
In this technique, the computation is generally scaled in terms of decimals. It means that the
result is generally scaled by multiplying or dividing it with pow(10,k).
Example:
- Normalize the input data is: - 15, 121, 201, 421, 561, 601, 850
- Step 1: Maximum value in given data(m): 850 and hence maximum absolute value is
1000
- Step 2: Divide the given data by 1000 (i.e j=3)
- Result: The normalized data is: - 0.015, 0.121, 0.201, 0.421, 0.561, 0.601, 0.85
Data Transformation: Data Normalization contd..
b. Min-Max Normalization (Linear Transformation)
- Minimum and maximum value from data is fetched and each value is
replaced according to the following formula.
Where - A is the attribute data(col)
- v and v’ is the old and new value of each entry in data
- min(A), max(A) are the minimum and maximum of A
- new_max(A), new_min(A) is the max and min value of the
required range(i.e boundary value) respectively.
Example
Input:- 10, 15, 50, 60
Normalized to range 0 to 1.
Here min=10, max= 60, new_min=0, new_max=1
Output:- 0, 0.1, 0.8, 1
Data Transformation: Data Normalization contd..
c. z-Score Normalization (zero-mean Normalization)
- Values are normalized based on mean and standard deviation of the data A.
- It is also called Standard Deviation method.
- Unstructured data can be normalized using z-score parameter,
where - - : mean
- S is the standard deviation.
- v and v’ is the old and new value of each data
Input:- 10, 15, 50, 60
Output:- 0.9515, 0.7512, 0.6510, 1.0517
Agenda
• Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
3. Data Reduction
• Problem:
Data Warehouse may store terabytes of data:
Complex data analysis/mining may take a very
long time to run on the complete data set
• Solution?
– Data reduction…
3. Data Reduction contd...
Obtains a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results
Integrity of the original data should be maintained even a after
reduction in data volume.
It should produce the same analytic result as on original data
Data reduction strategies
A. Data cube aggregation (applied to data cube)
B. Attribute Subset Selection(irrelevant attributes detected & removed)
C. Dimensionality reduction
D. Data compression
E. Numerosity reduction
F. Discretization and concept hierarchy generation
Data Reduction: Data Cube Aggregation
It is a process in which information is gathered and expressed in
a summary form
For example, above is the data of one company’s sales per quarter for
the year 2018 to the year 2022. If the problem is to get the annual sale
per year, then it is required to aggregate the sales per quarter for each
year. In this way, aggregation provides you with the required data, which
is much smaller in size, and thereby we achieve data reduction even
without losing any data.
Data Reduction: Data Cube Aggregation
• Data cubes store multidimensional aggregated information.
• Data cubes provide fast access to precomputed, summarized data, thereby
benefiting online analytical processing as well as data mining.
• Base cuboid: – The cube created at the lowest level of abstraction is
referred to as the base cuboid. Example: The base cuboid should
correspond to an individual entity of interest, such as sales and customers.
• Apex cuboid: A cube at the highest level of abstraction is the apex
cuboid. Example: For the sales data, the apex cuboid would give one total:-
the total sales
Base Cuboid Vs Apex Cuboid
Data Reduction: Attribute subset Selection
• It is the way to reduce the dimensionality of data through the use of
Feature selection.
• It aims to discover a minimum set of attributes such that the resulting
probability distribution of the data classes is as close as applicable to the
original distribution using all attributes.
• Attribute subset selection decreases the data set size by eliminating
irrelevant or redundant attributes (or dimensions).
• Redundant attributes
– Duplicate much or all of the information contained in one or more
other attributes
– E.g., purchase price of a product and the amount of sales tax paid
• Irrelevant attributes
– Contain no information that is useful for the data mining task at hand
– E.g. students' ID is irrelevant to the task of predicting students' CGPA
Data Reduction: Attribute subset Selection...
How can we find a ‘good’ subset of the original attributes?
• For n attributes, there are 2n possible subsets.
• An exhaustive search for the optimal subset of attributes can be
expensive, especially as n increase (Brute force method).
• Thus heuristic methods (Greedy method) that explore a reduced
search space are commonly used for attribute subset selection.
Heuristic methods:
i. Step-wise forward selection
ii. Step-wise backward elimination
iii. Combining forward selection and backward elimination
iv. Decision-tree induction
Data Reduction: Attribute subset Selection
i. Stepwise Forward Selection:
• The procedure starts with an empty set of attributes as the reduced set.
• First: The best single-feature is picked.
• Next: At each subsequent iteration or step, the best of the remaining
original attributes is added to the set.
Data Reduction: Attribute subset Selection
ii. Stepwise Backword Selection:
• The procedure starts with the full set of attributes.
• At each step, it removes the worst attribute remaining in the set.
iii. Combining forward selection and backward elimination
• The stepwise forward selection and backward elimination methods
can be combined
• At each step, the procedure selects the best attribute and removes the
worst from among the remaining attributes
Data Reduction: Attribute subset Selection
iv. Decision Tree Induction
• Decision tree induction (Classification Algorithm) constructs
a flowchart-like or tree like structure from given data where
each internal node denotes a test on an attribute, each branch
corresponds to an outcome of the test and each external node
denotes a class prediction.
• At each node, the algorithm chooses the “best” attribute to
partition the data into individual classes.
• All attributes that do not appear in the tree are assumed to be
irrelevant.
Data Reduction: Attribute subset Selection
iv. Decision Tree Induction contd...
• Nonleaf nodes: tests
• Branches: outcomes of tests
• Leaf nodes: class prediction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
> Reduced attribute set: {A1, A4, A6}
Data Reduction: Dimensionality Reduction
The number of input features, variables, or columns present in a
given dataset is known as dimensionality, and the process to reduce
these features is called dimensionality reduction.
Curse of Dimensionality:
• Model:- Dataset will be input to the training phase, training phase
will study the feature and output the model which can be able to
perceive or interpret the similar kind of the object.
Following are differenet models developed from datasets with diffrent
features but aiming for the same goal
M2 M3 M4 M5 M6 M7
M1
2 4 7 10 15 100 150
Threshold
Accuracy of Model Increases Accuracy of Model decreases
Data Reduction: Dimensionality Reduction
Example: Curse of Dimensionality:
Let one cricket ball is given to the training phase and it studies the object
with 4 feature;: Shape (sphere), Eatable (No), Play (yes), Color (red)
Object Shape Eatable Play Red Color
Problem of Overfitting:
To generate a model, when more attibutes/features will be fed at the
trainning phase then one confusing/erroneous model will be generated.
Here color dimension is the irrelevant dimension
The threshold dimension for the above model is 3
Data Reduction: Dimensionality Reduction
• Dimensionality reduction is a method of converting the high
dimensional variables into lower dimensional variables without
changing the specific information of the variables.
• It represent the original data in the compressed or reduced form by
appling data encoding or transformation.
• In the process of Compression the resultant data can be:
Lossless- If original data can be reconstructed form compressed data
without reconstructing the whole.
Lossy- If we can construct only an approximation of original data.
Popular methods of Lossy dimensionality reduction are
i. Discrete Wavelet Transform (DWT) (Sparse matrix created)
ii. Principal component Analysis (PCA) (Combines the essence of
attributes by creating an alternative, smaller set of variables)
Data Reduction: Data Compression
• Data compression employs modification, encoding, and converting
the structure of data in a way that consumes less space.
• It involves building a compact representation of information by
removing redundancy and representing data in binary form.
• Data that can be restored successfully from its compressed form is
called Lossless compression. In contrast, the opposite where it is not
possible to restore the original form from the compressed form is
Lossy compression.
Data Reduction: Data Compression
• Data compression technique reduces the size of the files using
different encoding mechanisms. Based on their compression
techniques it can be divides into two types
1. Lossless Compression: Encoding techniques like Run Length
Encoding and Huffman Encoding allows a simple and minimal
data size reduction. Lossless data compression uses algorithms to
restore the precise original data from the compressed data.
2. Lossy Compression: In lossy-data compression, the decompressed
data may differ from the original data but are useful enough to
retrieve information from them. For example, the JPEG image
format is a lossy compression, but we can find the meaning
equivalent to the original image. Methods such as the Discrete
Wavelet transform technique and PCA (principal component
analysis) are examples of this compression.
Data Reduction: Numerosity Reduction
The numerosity reduction reduces the original data volume by alternative
smaller data representations. This technique includes two types
parametric and non-parametric numerosity reduction.
i. Parametric methods
– Assume data fits some model, estimate model parameters, store
only the parameters, and discard the data (except possible outliers)
– E.g.: Regression Model: Linear, Multiple, Log-linear regression
ii. Non-parametric methods
– Do not assume models. This technique results in a more uniform
reduction, irrespective of data size, but it may not achieve a high
volume of data reduction like the parametric.
– E.g.: Histograms, clustering, sampling
Agenda
• Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
4. Discretization and concept hierarchy generation
Discretization techniques: It is the process to reduce the number of
values for a given continuous attribute, by dividing the attribute into a
range of intervals. Interval value labels can be used to replace actual
data values.
Concept hierarchies: It reduces the data by collecting and replacing
low-level concepts (such as numeric values for the attribute age) by
higher-level concepts (such as young, middle-aged, or senior).
This leads to a concise, easy-to use, knowledge-level representation of
mining results.
4. Discretization and concept hierarchy generation contd..
These are recursive methods where a large amount of time is spent on sorting
the data at each step. The smaller the number of distinct values to sort, the faster
these methods can be.
Discretization techniques can be categorized
based on whether it uses class information Discretization techniques can be
or not such as follows: categorized based on which
• Supervised Discretization - This direction it proceeds as follows:
discretization process uses class • Top-down approach
information. • Bottom-up Approach
• Unsupervised Discretization - This
discretization process does not use class
information.
4. Discretization and concept hierarchy generation contd..
Top-down Discretization (Splitting) -
If the process starts by first finding one or a few points called split points or
cut points to split the entire attribute range and then repeat this recursively
on the resulting intervals.
Bottom-up Discretization (Merging) -
Starts by considering all of the continuous values as potential split-points.
Removes some by merging neighborhood values to form intervals, and then
recursively applies this process to the resulting intervals.
4.. Discretization and concept hierarchy generation contd..
Many discretization techniques can be applied recursively to provide a
hierarchical or multiresolution partitioning of the attribute values
known as concept hierarchy.
A concept hierarchy for a given numeric attribute defines a
discretization of the attribute.
Concept hierarchies can be used to reduce the data y by replacing
low-level concepts (such as numeric value for the attribute age) with
higher-level concepts (such as young, middle-aged, or senior).
Although detail is lost by such generalization, it becomes meaningful
and it is easier to interpret.
4. Discretization and concept hierarchy generation contd..
Five methods for discretization & concept hierarchy for numeric
data are defined:
i. Binning
ii. Histogram
iii. Cluster Analysis
iv. Decision Tree/Entropy-Based Discretization
v. Correlation Analysis (Chi square)
4. Discretization and concept hierarchy generation contd..
Four methods of Concept hierarchy generation for non-numeric
(categorical/nominal) data
Nominal attributes have a finite but possibly large number of distinct
values which is difficult to do partitioning in terms of ordering. For this
concept hierarchy is used to transform the data into multiple levels of
granularity. Following are the four methods:
i. Specification of a partial ordering of attributes explicitly at the
schema level by users or experts
• Users/experts can easily define a concept hierarchy by
partial/total ordering of the schema attribute at the schema level.
• At the schema level, a hierarchy can be defined by specifying
the ordering among the attribute as:
street < city < state < country
4. Discretization and concept hierarchy generation contd..
ii. Specification of a portion of a hierarchy by explicit data grouping
• This is the manual definition of a portion of the concept
hierarchy.
• On the large database, on the attribute country (example) explicit
groupings can be made to achieve a small portion of intermediate
level data
{Gujurat, Maharastra, Goa...} Western India,
{Assam, Manipur} Eastern India
4. Discretization and concept hierarchy generation contd..
iii. Specification of a set of attributes, but not of their partial ordering
country 15 distinct values
province_or_ state 65 distinct values
city 3567 distinct values
street 674,339 distinct values
• Based on the observation that high concept level (country) attribute
usually contain a smaller number of distinct value than lower concept
level (street) attribute, a concept hierarchy can be generated based on
the number of distinct values per attribute in the given attribute set.
• The attribute with the most distinct value is placed in the lowest
hierarchy and so on.
4. Discretization and concept hierarchy generation contd..
iv. Specification of only a partial set of attributes
• At the time schema design, sometimes user carelessly include only a
small subset of the relevant attribute (partial) in the hierarchy
specification.
e.g. including street & city in hierarchical attribute set location
• Solution: Embed data semantics in the database schema => attributes
with tight semantic connection can be pinned together.
Agenda
• Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
Summary
• Data preparation is a big issue for both warehousing
and mining
• Data preparation includes
– Data cleaning and data integration
– Data reduction and feature selection
– Discretization
• A lot of methods have been developed but still an
active area of research