Data Mining
DATA PRE-PROCESSING
Instructor: Qurat-ul-Ain
quratulain.ssc@stmu.edu.pk
Background
Other names: Called as data scrubbing or cleaning.
More than data arranging: DWH is NOT just about arranging
data, but should be clean for overall health of organization.
Big problem, big effect: Enormous problem, as most data is dirty.
Data duplication: Original problem was removing duplicates in
one system, compounded by duplicates from many systems.
Three Classes of Anomalies
Syntactically Dirty Data
Lexical Errors
Irregularities
Semantically Dirty Data
Integrity Constraint Violation
Business rule contradiction
Duplication
Coverage Anomalies
Missing Attributes
Missing Records
Syntactically Dirty Data
1. Lexical Errors
Discrepancies between the structure of the data items and
the specified format of stored values e.g. number of columns
used are unexpected for a tuple (mixed up number of
attributes)
Every tuple is expected to consist of 4 values, one for each
attribute. In this case, a missing ZIP value caused a shift of
the HDATE value to the ZIP column. This may occur in case
an employee’s ZIP code is unavailable when data is
transferred to the data warehouse
Syntactically Dirty Data
2. Irregularities
“irregularities are concerned with the non-uniform use
of values, units and abbreviations”
For example only giving annual salary but without
info i.e. in US$ or PK Rs?
Syntactically Dirty Data
1. Integrity Constraint violation
2. Contradiction (DoB > Hiring date etc)
3. Duplication
Duplicate anomalies are often caused during the process of
integrating different data sources into a single data
warehouse.
Coverage or lack of it
1. Missing Attribute
missing attribute values issue is the most common
failure in data warehouses.
Result of omissions while collecting the data.
Coverage or Lack of it
2. Missing Tuple
Missing tuples can occur in a data warehouse if some
entities that exist in the are not represented by tuples in
the database.
In the example, which is described in section 3, the table
“Employee” consists of three tuples. The Employee
„Smith“ is working for the company, but the appropriate
tuple is missing in the table below.
Why Coverage Anomalies?
Equipment malfunction (bar code reader, keyboard etc.)
Inconsistent with other recorded data and thus deleted.
Data not entered due to misunderstanding/illegibility.
Data not considered important at the time of entry (e.g.
Y2K).
Handling Missing Data
Dropping records.
“Manually” filling missing values.
Using a global constant as filler.
Using the attribute mean (or median) as filler.
Using the most probable value as filler.
Data Preprocessing: An
Overview
No quality data, no quality results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect
or even misleading statistics.
Data warehouse needs consistent integration of quality
data
Data Preprocessing
Data Quality:
Accuracy
Completeness
Consistency.
Problems:
Inaccuracy:
Incomplete
Inconsistency
Major Tasks in Data
Preprocessing
Data cleaning
Parsing ,correcting, Standardization, Matching, Consolidation
Dealing with missing data.
Dealing with incorrect and noisy data.
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases or files
Different Concepts may have different names, causing
inconsistencies and redundancies.
Some attributes may be inferred from others, e.g. annual revenue.
Major Tasks in Data
Preprocessing
Data reduction
Reduced representation of data set much smaller in volume,
yet produces same results. Volume of data is reduced to make
analysis easier.
Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the
same) analytical results
Why data reduction?
A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on
the complete data set. Dimensionality reduction: Attribute
subset selection, attribute construction.
Data Reduction Strategies
There are three data reduction strategies
Dimensionality Reduction
Numerosity Reduction
Data Compression
Data Reduction Strategies
Numerosity reduction: Data is replaced by alternative
smaller representation using parametric models (e.g.
Histograms, Clusters, Linear Regression). It can be of two
types
Parametric: Only parameters of data and outliers are
stored instead of actual data ( Regression and Log-Linear
Models)
Non Parametric: Data is stored in the form of histogram,
clustering and sampling.
Major Tasks in Data
Preprocessing
Data transformation and data discretization
Normalization: Scaling to a smaller range. E.g. [0-1]
E.g. Normalizing the age and salary to [0-1]
Without normalization, distance based calculations may
generate skewed results.
Concept hierarchy generation: Raw data values are
replaced by ranges.
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect
data, e.g., instrument faulty, human or computer error,
transmission error
Incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data e.g.,
Occupation=“ ” (missing data)
Noisy: containing noise, errors, or outliers e.g., Salary=“−10”
(an error)
Inconsistent: containing discrepancies in codes or names,
e.g.,
Age=“42”, Birthday=“03/07/2010”
Incomplete (Missing) Data
Data is not always available
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
certain data may not be considered important at the time
of entry
Missing data may need to be inferred
Manual Entry (mean /probable)
How to Handle Missing Data?
Ignore the tuple:
Usually done when class label is missing (when doing
classification)—not effective when the % of missing values
per attribute varies considerably
Fill in the missing value manually:
tedious + infeasible?
Fill in it automatically with
A global constant : e.g., “unknown”, a new class?
Use any central tendency measure:
Mean for symmetric
Median for skewed
How to Handle Missing Data?
The attribute mean for all samples belonging to the same
class:
For example, while filling values of annual income
according to credit risk, fill value from the incomes of
customers of same class.
The most probable value: inference-based such as
Bayesian formula or decision tree
Noisy Data
Noise:
Random error in a measured variable
Incorrect attribute values may be due to
faulty data collection instruments
data entry problems
data transmission problems
22
How to Handle Noisy Data?
Regression
Smooth by fitting the data into regression functions.
Finding the best line to fit two attributes so that one
attribute can be used to predict other’s value.
Clustering
Detect and remove outliers
Combined computer and human inspection
Detect suspicious values and check by human (e.g., deal
with possible outliers)
Dimensionality Reduction
Dimensionality reduction, e.g., remove unimportant attributes or
reduce no of input variable
It represent the original data in the compressed or reduced
form by applying data encoding and transformation.
It’s a encoding mechanism is used to reduce size.
If original data can be reconstruct from compressed data
without loosing any information is called lossless.
If reconstructed data is the approximation of compressed data
is called lossy.
Wavelet transforms
Principal Components Analysis (PCA)
Feature subset selection, feature creation
Feature/Attribute Selection
Feature Selection is the method of reducing the input
variable to your model by using only relevant data and
getting rid of noise in data.
It is the process of automatically choosing relevant features for
your machine learning model based on the type of problem you
are trying to solve. Following are the feature selection method
Filter Methods: ( IG, Chi-Square Test, and Correlation
Coefficient)
Wrapper Methods: (Recursive Feature Elimination and Genetic
Algorithms)
Embedded Methods (Decision Tree)
Feature/Attribute Subset Selection
Remove redundant and irrelevant attributes
Redundant attributes
Duplicate much or all of the information contained in one or more
other attributes
E.g., purchase price of a product and the amount of sales tax paid
Irrelevant attributes
Contain no information that is useful for the data mining task at
hand
E.g., students' ID is often irrelevant to the task of predicting
students' GPA
Domain Expert
Heuristic Search
Heuristic Search in Attribute
Selection
There are 2d possible attribute combinations of d attributes
Not feasible.
Typical heuristic attribute selection methods:
Best single attribute under the attribute independence
assumption: choose by significance tests (in Chapter 7)
Best step-wise feature selection:
The best single-attribute is picked first
Then next best attribute condition to the first, ...
Step-wise attribute elimination:
Repeatedly eliminate the worst attribute
Best combined attribute selection and elimination
Heuristic Search in Attribute
Selection
Decision Tree Induction
Decision tree constructs a flow chart like structure
where each internal node denotes a test on
attribute, each corresponds to an outcome of the
test, each external leaf denotes a class prediction.
At each node, algorithm selects best attribute
to partition data in subclasses.
All non-selected attributes are considered irrelevant.
Decision Tree Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
Reduced
> attribute set: {A1, A4, A6}
Numerosity Reduction
Numerosity reduction (some simply call it: Data
Reduction)
Replace original data volume by alternative, smaller forms of
data representation
Parametric (Regression):
Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
Non-Parametric:
Do not assume models
Major families: histograms, clustering, sampling, …
Data compression
Regression and Log-Linear
Models
Linear Regression
Data modeled to fit a straight line
Y = w X + b
Two regression coefficients, w and b, specify the line and
are to be estimated by using the data at hand
Multiple Regression
Allows a response variable Y to be modeled as a linear
function of multidimensional feature vector.
Y = b 0 + b1 X1 + b2 X2
Many nonlinear functions can be transformed into the
above
Histogram Analysis
Divide data into buckets and store average (sum) for each
bucket
Partitioning rules:
Equal-width: equal bucket range
Equal-frequency (or equal-depth)
Clustering
Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
Centroid
Diameter
Sampling
Sampling: obtaining a small sample s to represent the whole
data set N
Key principle: Choose a representative subset of the data
Simple random sampling may have very poor performance
in the presence of skew
Develop adaptive sampling methods, e.g., stratified
sampling:
Types of Sampling
Simple random sampling
There is an equal probability of selecting any particular item
Sampling without replacement
Once an object is selected, it is removed from the population
Sampling with replacement
A selected object is not removed from the population
Stratified sampling:
Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the
data)
Used in conjunction with skewed data
Sampling: With or without
Replacement
R S WOR ndom
S le ra hout
i m p
(s
p l e wit
sam ment)
e p l a ce
r
SRSW
R
Raw Data
Data Compression
String Compression
There are extensive theories and well-tuned algorithms
(Huffman Encoding)
Typically lossless, but only limited manipulation is
possible without expansion (Text Encryption)
Audio/Video Compression
Typically lossy compression, with progressive refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Compressed Ratio: Compressed Size/Original Size
Compressed Factor: 1/Compressed Ratio
Data Compression
Original Data Compressed
Data
lossless
sy
los
Original Data
Approximated
Discretization
It divides the range of attributes into intervals so as to
reduce number of values for a given continuous attribute.
Splitting: Top down (Attribute is split into range of value)
Merge: Bottom Up (Initially we consider all later remove
some during merging)
Interval labels can then be used to replace actual data
values
Reduce data size by discretization
Supervised vs. unsupervised (if class information is used)
Discretization can be performed recursively on an attribute
Discretization
Concept Hierarchy
Helps in reducing the data by collecting and replacing low level
concept with high level concepts. Eg. Mobile no and land line no
replace it with telephone number.
Three types of attributes
Nominal—values from an unordered set, e.g., color,
profession
Ordinal—values from an ordered set, e.g., military or
academic rank
Numeric—real numbers, e.g., integer or real numbers
Data Discretization Methods
Typical methods: All the methods can be applied recursively
Binning /Histogram
Top-down split, unsupervised
Clustering analysis (unsupervised, top-down split or
bottom-up merge)
Decision-tree analysis (supervised, top-down split)
Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
Discretization by Classification
& Correlation Analysis
Classification (e.g., decision tree analysis)
Supervised: Given class labels, e.g., cancerous vs. benign
Using entropy to determine split point (discretization point)
Top-down, recursive split
Details to be covered in Chapter 7
Correlation analysis (e.g., Chi-merge: χ2-based discretization)
Supervised: use class information
Bottom-up merge: find the best neighboring intervals (those having
similar distributions of classes, i.e., low χ2 values) to merge
Merge performed recursively, until a predefined stopping condition
43
How to Handle Noisy Data?
Binning
Smooths sorted data by consulting its
neighborhood.
first sort data and partition into (equal-frequency) bins
then one can smooth by bin
means,
median,
bin boundaries, etc.
Simple Discretization:
Binning
Equal-width (distance) partitioning
Divides the range into N intervals of equal size: uniform
grid
If A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B –A)/N.
The most straightforward, but outliers may dominate
presentation Equal-depth (frequency) partitioning
Divides the range into N intervals, each containing
approximately same number of samples
Binning Methods for Data
Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Integration
Combines data from multiple sources into a coherent store
Careful integration can help reduce and avoid redundancies
and inconsistencies.
Entity identification problem:
Identify real world entities from multiple data sources, e.g.,
Bill Clinton = William Clinton
Be careful while Schema integration
E.g., A.cust-id B.cust-#
Detecting and resolving data value conflicts
For the same real world entity, attribute values from
different sources are different
Possible reasons: different representations, different
scales, e.g., metric vs. British units
49
Data Integration
Integrate metadata from different sources
Metadata can be used to improve it
What is meta data ?
Redundant data occur often when integration of multiple
databases
Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality 50
Correlation Analysis (Nominal
Data)
Χ2 (chi-square) test
2
(Observed Expected )
2
Expected
The larger the Χ2 value, the more likely the variables are
related
The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
Correlation Analysis (Numeric
Data)
Correlation coefficient (also called Pearson’s product moment
coefficient)
n n
(ai A)(bi B ) (ai bi ) n AB
rA, B i 1
i 1
(n) A B (n) A B
A Band
where n is the number of tuples, are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
If rA,B > 0, A and B are positively correlated (A’s values increase as
B’s). The higher, the stronger correlation.
rA,B = 0: independent; rAB < 0: negatively correlated
Covariance (Numeric Data)
Covariance is similar to correlation
Correlation coefficient:
where n is the number of tuples,
A and
B are the respective
mean or expected values of A and B, σA and σB are the
respective standard deviation of A and B.
Covariance (Numeric Data)
Positive covariance: If CovA,B > 0, then A and B both tend to be
larger than their expected values.
Negative covariance: If CovA,B < 0 then if A is larger than its
expected value, B is likely to be smaller than its expected value.
Independence: CovA,B = 0 but the converse is not true:
Some pairs of random variables may have a covariance of 0
but are not independent. Only under some additional
assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
Co-Variance: An Example
It can be simplified in computation as
Co-Variance: An Example
Suppose two stocks A and B have the following values in one
week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
Question: If the stocks are affected by the same industry
trends, will their prices rise or fall together?
E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Thus, A and B rise together since Cov(A, B) > 0.
Normalization
Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range $12,000 to $98,000 normalized to
[0.0, 1.0]. Then $73,000 is mapped to
73,600 12,000
(1.0 0) 0 0.716
98,000 12,000
Normalization
Z-score normalization (μ: mean, σ: standard deviation):
Ex. Let μ = 54,000, σ = 16,000. Then
v A
v'
A
73,600 54,000
1.225
16,000
Z-score Normalization
Simply put, a z-score (also called a standard score) gives
you an idea of how far from the mean a data point is.
However, technically it’s a measure of how many standard
deviations below or above the population mean a raw score
is.
A z-score can be placed on a normal distribution curve. Z-
scores range from -3 standard deviations (which would fall
to the far left of the normal distribution curve) up to +3
standard deviations (which would fall to the far right of the
normal distribution curve).
In order to use a z-score, you need to know the mean μ and
Z-score Normalization
Z-scores are a way to compare results to a “normal”
population.
Results from tests or surveys have thousands of possible
results and units; those results can often seem meaningless.
For example, knowing that someone’s weight is 150 pounds
might be good information, but if you want to compare it to
the “average” person’s weight, looking at a vast table of
data can be overwhelming (especially if some weights are
recorded in kilograms).
A z-score can tell you where that person’s weight is
compared to the average population’s mean weight.
Self Study
Following topics are not included in syllabus for
exams/quizzes.
If you are interested in any of these topics, you can come to
my office to learn/discuss.
Segmentation by Natural
Partitioning
A simply 3-4-5 rule can be used to segment numeric data
into relatively uniform, “natural” intervals.
If an interval covers 3, 6, 7 or 9 distinct values at the
most significant digit, partition the range into 3
(relatively) equi-width intervals
If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into 4 intervals
If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals
Example of 3-4-5 Rule
count
Step 1: -$351 -$159 profit $1,838 $4,700
Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000
(-$1,000 - $2,000)
Step 3:
(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)
(-$4000 -$5,000)
Step 4:
($2,000 - $5, 000)
(-$400 - 0) (0 - $1,000) ($1,000 - $2, 000)
(0 -
($1,000 -
(-$400 - $200)
$1,200) ($2,000 -
-$300) $3,000)
($200 -
($1,200 -
$400)
(-$300 - $1,400)
($3,000 -
-$200)
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) ($600 - ($1,600 - $5,000)
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
Concept Hierarchy Generation
Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically
Concept hierarchy formation: Recursively reduce the data by
collecting and replacing low level concepts (such as numeric
values for age) by higher level concepts (such as youth,
adult, or senior)
Concept hierarchies can be explicitly specified by domain
experts and/or data warehouse designers
Concept hierarchy can be automatically formed for both
numeric and nominal data. For numeric data, use 3-4-5 rule.
Concept Hierarchy Generation
for Nominal Data
Specification of a partial/total ordering of attributes explicitly
at the schema level by users or experts
street < city < state < country
Specification of only a partial set of attributes
E.g., only street < city, not others
Automatic Concept Hierarchy Generation
Some hierarchies can be automatically generated based on the
analysis of the number of distinct values per attribute in the data set
The attribute with the most distinct values is placed at the lowest
level of the hierarchy
Exceptions, e.g., weekday, month, quarter, year
country 15 distinct values
province_or_ state 365 distinct values
city 3567 distinct values
street 674,339 distinct values