M2 PPT
M2 PPT
In Previous Class,
• We discuss various type of Data with examples
• In this Class,
• We focus on Data pre-processing – “an important
milestone of the Data Mining Process”
Data analysis pipeline
Mining is not the only step in the analysis process
Continuous Attribute
Has real numbers as attribute values
Examples : temperature, height, or weight.
Practically, real values can only be measured and
represented using a finite number of digits.
Data Preprocessing
Major Tasks in Data Preprocessing
outliers=exceptions!
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
• Part of data reduction but with particular importance, especially for
numerical data
Forms of data preprocessing
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, E g.,
Instrument faulty, human or computer error, Transmission error
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
How to Handle Missing
Data?
• Ignore the tuple: usually done when class label is missing (assuming the tasks
in classification)—not effective when the percentage of missing values per
attribute varies considerably.
• Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!
• Use the attribute mean for all samples belonging to the same class to fill in the
missing value: smarter
• Use the most probable value to fill in the missing value: inference-based such
as Bayesian formula or decision tree
How to Handle Missing
Data?
Fill missing values using aggregate functions (e.g., average) or probabilistic estimates
on global value distribution
E.g., put the average income here, or put the most probable income based on the fact
that the person is 39 years old
E.g., put the most frequent religion here
Data Quality
Equal-depth binning
Divides the range into N intervals,
each containing approximately same number of records
Skewed data is also handled well
Simple Methods: Binning
Data Quality: Handle
Noise(Binning)
Example: Sorted price values 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
• Partition into three (equi-depth) bins
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
• Smoothing by bin means
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Quality: Handle
Noise(Regression)
•Replace noisy or missing values by
predicted values
•Requires model of attribute
dependencies (maybe wrong!)
•Can be used for data smoothing or
for handling missing data
Data Integration
The process of combining multiple sources into a single dataset. The Data
integration process is one of the main components in data management.
Data integration:
Combines data from multiple sources into a coherent store
Schema integration integrate metadata from different sources
metadata: data about the data (i.e., data descriptors)
Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id B.cust-#
Detecting and resolving data value conflicts for the same real
world entity, attribute values from different sources are
different (e.g., J.D.Smith and Jonh Smith may refer to the same
person)
possible reasons: different representations, different scales,
e.g., metric vs. British units (inches vs. cm)
Handling Redundant Data in Data Integration
Centralization:
Based on fitting a distribution to the data
Distance function between distributions
KL Distance
Mean Centering
Data Transformation: Normalization
Example: Data Transformation
- Assume, min and max value for height and weight.
- Now, apply Min-Max normalization to both attributes as given
follow
(1) (5.9 ft, 50 Kg)
(2) (4.6 ft, 55 Kg)
Vs.
(1) (5.9 ft, 50 Kg)
(2) (5.6 ft, 56 Kg)
- Compare your results...
Data Transformation:
Aggregation
Combining two or more attributes (or objects) into a
single attribute (or object)
Purpose
Data reduction
Reduce the number of attributes or objects
Change of scale
Cities aggregated into regions, states, countries,
etc
More “stable” data
Aggregated data tends to have less variability
Data Transformation: Discretization
s sy
lo
Original Data
Approximated
Histograms
40
• A popular data reduction
technique 35
• Divide data into buckets and 30
store average (or sum) for each
25
bucket
20
• Can be constructed optimally
in one dimension using 15
dynamic programming 10
• Related to quantization 5
problems.
0
10000 30000 50000 70000 90000
Histogram types
• Equal-width histograms:
• It divides the range into N intervals of equal size
• Equal-depth (frequency) partitioning:
• It divides the range into N intervals, each containing approximately same number of
samples
• V-optimal:
• It considers all histogram types for a given number of buckets and chooses the one
with the least variance.
• MaxDiff:
• After sorting the data to be approximated, it defines the borders of the buckets at
points where the adjacent values have the maximum difference
• Example: split 1,1,4,5,5,7,9,14,16,18,27,30,30,32 to three buckets
Clustering
• Partitions data set into clusters, and models it by one representative
from each cluster
cluster
outlier
age
Hierarchical Reduction
• Use multi-resolution structure with different degrees of reduction
• Hierarchical clustering is often performed but tends to define
partitions of data sets rather than “clusters”
• Parametric methods are usually not amenable to hierarchical
representation
• Hierarchical aggregation
• An index tree hierarchically divides a data set into partitions by value range
of some attributes
• Each partition can be considered as a bucket
• Thus an index tree with aggregates stored at each node is a hierarchical
histogram
Discretization
• Three types of attributes:
• Nominal — values from an unordered set
• Ordinal — values from an ordered set
• Continuous — real numbers
• Discretization:
• divide the range of a continuous attribute into intervals
• why?
• Some classification algorithms only accept categorical
attributes.
• Reduce data size by discretization
• Prepare for further analysis
Discretization and Concept hierarchy
• Discretization
• reduce the number of values for a given continuous attribute
by dividing the range of the attribute into intervals. Interval
labels can then be used to replace actual data values.
• Concept hierarchies
• reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior).
Discretization and concept hierarchy
generation for numeric data
• Binning/Smoothing
• Histogram analysis
• Clustering analysis
• Entropy-based discretization
Entropy:
• Data comes in all shapes and sizes – CSV files, PDFs, stone
tablets, .jpg…
X1