Data preprocessing
Data Cleaning
Data Integration
Data Reduction
Data Transformation and Data Discretization
Data Pre-procesing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
e.g., Occupation=“ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?
How to handle missing data
Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., “unknown”, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class: smarter
the most probable value: inference-based such as Bayesian formula or decision tree
How to handle noisy data
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with possible outliers)
Data integration
Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different sources are
different
Possible reasons: different representations, different scales, e.g., metric
vs. British units
Handling Redundancy in Data Integration
Redundant data occur often when integration of multiple databases
Object identification: The same attribute or object may have different names in different
databases
Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue
Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality
Correlation Analysis
Χ2 (chi-square) test
(Observed Expected ) 2
2
Expected
The larger the Χ2 value, the more likely the variables are related
The cells that contribute the most to the Χ2 value are those whose
actual count is very different from the expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
Correlation Analysis
Correlation coefficient (also called Pearson’s product moment coefficient)
i 1 (ai A)(bi B)
n n
(ai bi ) n AB
rA, B i 1
(n 1) A B (n 1) A B
where n is the number of tuples, and are the respective means of A and B, σA
and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the
AB cross-product.
If rA,B > 0, A and B are positively correlated (A’s values increase as B’s).
The higher, the stronger correlation.
rA,B = 0: independent; rAB < 0: negatively correlated
Data Reduction
Data reduction: Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results
Why data reduction? — A database/data warehouse may store terabytes of data. Complex
data analysis may take a very long time to run on the complete data set.
Data reduction strategies
Dimensionality reduction, e.g., remove unimportant attributes
Wavelet transforms
Principal Components Analysis (PCA)
Feature subset selection, feature creation
Numerosity reduction (some simply call it: Data Reduction)
Regression and Log-Linear Models
Histograms, clustering, sampling
Data cube aggregation
Data compression
Dimensionality reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier analysis, becomes less
meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Wavelet transforms
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is
mapped to
Z-score normalization (μ: mean, σ: standard deviation):
v A
v'
A
Ex. Let μ = 54,000, σ = 16,000. Then
NormalizationWhere
by decimal scaling
j is the smallest integer such that Max(|ν’|) < 1
v
v' j
10
Aggregation
Combining of two o more record into single object
Sampling Techniques
Simple random sampling
There is an equal probability of selecting any particular item
Sampling without replacement
Oncean object is selected, it is removed from the
population
Sampling with replacement
A selected object is not removed from the population
Stratified sampling:
Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the
data)
Used in conjunction with skewed data
Progressive Sampling
Starts with a very small Sample and then starts increasing the sample size
until a sample of sufficient size is obtained.
Dimensionality Reduction
Need ?
Many algorithms work with low dimensionality .
Allows Data visualization better
Amount of time for processing and memory required is reduced
Dimensionality reduction reduce dimensionality by creating new attributes
which are combinations of existing attributes .
The reduction of dimensionality by selecting new attributes that are subset of
old attributes is known as feature subset selection.
Curse of Dimensionality
Feature Subset Selection
There are three standard methods for feature subset selection
Embedded Subset Selection
Filter approach
Wrapper Approach
Feature Subset Selection
Feature Extraction
Creating new set of features from the original set of features is known as
feature extraction.
Discretization and Binarization
Mapping continuous valued attributes to categorical attributes is called
Discretization.
Mapping continuous valued attributes to one or more binary attributes is
called binarization.
Example
Discretization of Continuous values
attributes
Unsupervised Discretization
Equal Width intervals
Equal Depth Intervals
Supervised Discretization