DATA PRE-PROCESSING
▪ in real world application, data can be inconsistent,
incomplete and/or noisy
▪ Errors can happen
▪ prediction rate should be lower
▪ produces better models, faster because a good data is a
prerequisite for producing effective models of any type
▪ Analyzing data that has not been carefully screened for
such problems can produce highly misleading results.
▪Then, the success of data mining projects heavily depends
on the quality of the prepared data.
▪Data preparation is about constructing a dataset from one
or more data sources to be used for exploration and
modeling.
Start with an initial dataset to get familiar with the data, to
▪
discover first insights into the data and have a good
understanding of any possible data quality issues.
Data cleaning attempts
to: Fill in missing values
▪ ▪
Smooth out noisy data ▪
Correct inconsistencies
▪ Ignore the tuple with missing values;
▪ Fill in the missing values manually;
▪ Use a global constant to fill in missing values (NULL, unknown,
etc.);
▪ Use the attribute value mean to filling missing values of that
attribute;
▪ Use the attribute mean for all samples belonging to the same
class to fill in the missing values;
▪ Infer the most probable value to fill in the missing value.
▪ The purpose of data smoothing is to eliminate noise.
▪This can be done by:
✓ Binning
✓ Clustering
✓ Regression
▪Binning smooth the data by consulting the value’s
neighborhood.
▪ It aims to remove the noise from the data set [1]
smoothing the data by equal frequency bins [2]
smoothing by bin means;
[3] smoothing by bin boundaries
Unsorted Data for price in dollars:
8, 16, 9, 15, 21, 24, 30, 26, 27, 30, 34, 21
STEP 1: Sort the Data
8, 9, 15, 16, 21, 21, 24, 26,27, 30, 30,
34 Smooth the data by equal frequencies
bins: Bin 1: 8, 9, 15, 16
Bin 2: 21, 21, 24, 26
Bin 3: 27, 30, 30, 34
STEP 1: Sorted Data 8, 9, 15, 16, 21, 21, 24, 26,27, 30, 30, 34
Smooth the data by bin means:
For Bin 1: (8 + 9 + 15 + 16)/4= 12
Bin 1: 12, 12, 12, 12
Bin 2:
Bin 3:
STEP 1: Sorted Data 8, 9, 15, 16, 21, 21, 24, 26,27, 30, 30, 34
Smooth the data by bin boundaries:
▪ Pick the MIN and MAX value
▪ Put the MIN in the left side and MAX on the right side ▪
Middle values in bin boundaries move to its closest neighbor
value with less distance
Bin 1: 8, 8, 16, 16
Bin 2:
Bin 3:
▪Data is organized
into groups of “similar”
values.
▪Rare values that fall
outside these groups are
considered outliers and
are discarded.
▪Data regression
consists of fitting the
data to a function.
▪A linear regression for
instance, finds the line to fit 2
variables so that one variable
can predict the other.
▪More variables can be
involved in a multiple linear
regression.
Data analysis may require a combination of data from
multiple sources into a coherent data store.
There are many challenges:
▪ Schema integration:
CID » C_number » Cust-id » cust#
▪ Semantic heterogeneity
▪ Data value conflicts (different representations or scales,
etc.)
There are many challenges:
▪ Redundant records
▪ Redundant attributes
(redundant if it can be derived from other
attributes) ▪ Correlation analysis P(AÙB)/(P(A)P(B))
1: independent, >1 positive correlation, <1: negative
correlation.
▪ Data is sometimes in a form not appropriate for
mining.
▪ Either the algorithm at hand can not handle it, the
form of the data is not regular, or the data itself is not
specific enough.
▪ Normalization
(to compare carrots with carrots)
▪ Smoothing
▪ Aggregation
(summary operation applied to data)
▪ Generalization
(low level data is replaced with level data – concept
hierarchy
Min-max normalization: linear transformation from v to
v’ ▪ v’= v-min/(max – min) (newmax – newmin) + new min
▪ Example:
transform $30000 between [10000..45000] into [0..1] → Î30-10/35(1)+0=0.514
Zscore normalization:
▪ normalization v into v’ based on attribute value mean and standard deviation
▪ v’=v-Mean/StandardDeviation
Normalization by decimal scaling:
▪ moves the decimal point of v by j positions such that j is the minimum
number of positions moved to the decimal of the absolute maximum
value to make is fall in [0..1]. v’=v/10j
▪ Example:
if v ranges between –56 and 9976, j=4 →
v’ ranges between –0.0056 and 0.9976
▪ The data is often too large.
▪ Reducing the data can improve performance.
▪ Data reduction consists of reducing the representation of
the data set while producing the same (or almost the same)
results.
Data reduction
includes: Data cube
▪
aggregation
▪ Dimension reduction
▪ Data compression
▪Discretization
▪Numerosity reduction
▪ Reduce the data to
the concepts level
needed in the
analysis
▪Queries regarding
aggregated
information should
be answered using data
cube when possible
▪ Feature selection (i.e.,
attribute subset selection)
▪Use heuristics: select local
‘best’ (or most pertinent)
attribute
▪Decision Tree Induction
Data compression reduces the size of data.
▪ saves storage space.
▪ saves communication time.
Data compression is beneficial if data mining algorithms can
manipulate compressed data directly without
uncompressing it.
Parametric
▪Regression (a model or function estimating the
distribution instead of the data.)
Non-parametric
▪Histograms
▪Clustering
▪Sampling
A popular data reduction technique:
▪ Divide data into buckets and store representation of
buckets ▪ (sum, count, etc.)
▪ Equi-width
▪ Equi-depth
▪ V-Optimal
▪ MaxDiff
▪ Partition data into clusters
based on “closeness” in space.
▪ Retain representatives of
clusters (centroids) and
outliers.
▪ Effectiveness depends upon the
distribution of data
▪Hierarchical clustering is possible (multi
resolution).
Allows a large data set to be represented by a much smaller
random sample of the data (sub-set).
▪Simple random sample without replacement (SRSWOR)
▪Simple random sampling with replacement (SRSWR)
▪Cluster sample (SRSWOR or SRSWR from clusters)
▪Stratified sample
▪ Discretization is used to reduce the number of values for
a given continuous attribute, by dividing the range of the
attribute into intervals.
Discretization can reduce the data set, and can also be used
▪
to generate concept hierarchies automatically