[go: up one dir, main page]

0% found this document useful (0 votes)
15 views67 pages

Data Preprocessing

The document discusses data pre-processing in data mining, emphasizing the importance of data cleaning, integration, and reduction to ensure data quality. It outlines various types of anomalies such as syntactically and semantically dirty data, as well as coverage issues like missing attributes and records. The document also covers techniques for handling missing and noisy data, along with strategies for dimensionality reduction and data transformation.

Uploaded by

Obaid Amir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views67 pages

Data Preprocessing

The document discusses data pre-processing in data mining, emphasizing the importance of data cleaning, integration, and reduction to ensure data quality. It outlines various types of anomalies such as syntactically and semantically dirty data, as well as coverage issues like missing attributes and records. The document also covers techniques for handling missing and noisy data, along with strategies for dimensionality reduction and data transformation.

Uploaded by

Obaid Amir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 67

Data Mining

DATA PRE-PROCESSING

Instructor: Qurat-ul-Ain
quratulain.ssc@stmu.edu.pk
Background
 Other names: Called as data scrubbing or cleaning.

 More than data arranging: DWH is NOT just about arranging

data, but should be clean for overall health of organization.

 Big problem, big effect: Enormous problem, as most data is dirty.

 Data duplication: Original problem was removing duplicates in

one system, compounded by duplicates from many systems.


Three Classes of Anomalies
Syntactically Dirty Data
 Lexical Errors
 Irregularities

Semantically Dirty Data


 Integrity Constraint Violation
 Business rule contradiction
 Duplication

Coverage Anomalies
 Missing Attributes
 Missing Records
Syntactically Dirty Data
1. Lexical Errors
 Discrepancies between the structure of the data items and

the specified format of stored values e.g. number of columns


used are unexpected for a tuple (mixed up number of
attributes)
 Every tuple is expected to consist of 4 values, one for each

attribute. In this case, a missing ZIP value caused a shift of


the HDATE value to the ZIP column. This may occur in case
an employee’s ZIP code is unavailable when data is
transferred to the data warehouse
Syntactically Dirty Data
2. Irregularities
 “irregularities are concerned with the non-uniform use

of values, units and abbreviations”


 For example only giving annual salary but without

info i.e. in US$ or PK Rs?


Syntactically Dirty Data
1. Integrity Constraint violation

2. Contradiction (DoB > Hiring date etc)

3. Duplication

Duplicate anomalies are often caused during the process of


integrating different data sources into a single data
warehouse.
Coverage or lack of it
1. Missing Attribute
 missing attribute values issue is the most common

failure in data warehouses.


 Result of omissions while collecting the data.
Coverage or Lack of it
2. Missing Tuple
 Missing tuples can occur in a data warehouse if some

entities that exist in the are not represented by tuples in


the database.
 In the example, which is described in section 3, the table

“Employee” consists of three tuples. The Employee


„Smith“ is working for the company, but the appropriate
tuple is missing in the table below.
Why Coverage Anomalies?
 Equipment malfunction (bar code reader, keyboard etc.)

 Inconsistent with other recorded data and thus deleted.

 Data not entered due to misunderstanding/illegibility.

 Data not considered important at the time of entry (e.g.

Y2K).
Handling Missing Data
 Dropping records.

 “Manually” filling missing values.

 Using a global constant as filler.

 Using the attribute mean (or median) as filler.

 Using the most probable value as filler.


Data Preprocessing: An
Overview
 No quality data, no quality results!

 Quality decisions must be based on quality data

 e.g., duplicate or missing data may cause incorrect


or even misleading statistics.
 Data warehouse needs consistent integration of quality

data
Data Preprocessing
 Data Quality:

 Accuracy

 Completeness

 Consistency.

 Problems:

 Inaccuracy:

 Incomplete

 Inconsistency
Major Tasks in Data
Preprocessing
 Data cleaning

 Parsing ,correcting, Standardization, Matching, Consolidation

 Dealing with missing data.

 Dealing with incorrect and noisy data.

 Fill in missing values, smooth noisy data, identify or remove

outliers, and resolve inconsistencies

 Data integration

 Integration of multiple databases or files

 Different Concepts may have different names, causing


inconsistencies and redundancies.
 Some attributes may be inferred from others, e.g. annual revenue.
Major Tasks in Data
Preprocessing
 Data reduction

 Reduced representation of data set much smaller in volume,

yet produces same results. Volume of data is reduced to make


analysis easier.
 Obtain a reduced representation of the data set that is much

smaller in volume but yet produces the same (or almost the
same) analytical results
 Why data reduction?

 A database/data warehouse may store terabytes of data.

Complex data analysis may take a very long time to run on


the complete data set. Dimensionality reduction: Attribute
subset selection, attribute construction.
Data Reduction Strategies
There are three data reduction strategies
Dimensionality Reduction

Numerosity Reduction

Data Compression
Data Reduction Strategies
 Numerosity reduction: Data is replaced by alternative

smaller representation using parametric models (e.g.


Histograms, Clusters, Linear Regression). It can be of two
types
 Parametric: Only parameters of data and outliers are

stored instead of actual data ( Regression and Log-Linear


Models)
 Non Parametric: Data is stored in the form of histogram,

clustering and sampling.


Major Tasks in Data
Preprocessing
 Data transformation and data discretization

 Normalization: Scaling to a smaller range. E.g. [0-1]

 E.g. Normalizing the age and salary to [0-1]

 Without normalization, distance based calculations may

generate skewed results.


 Concept hierarchy generation: Raw data values are
replaced by ranges.
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect

data, e.g., instrument faulty, human or computer error,


transmission error
 Incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data e.g.,
Occupation=“ ” (missing data)
 Noisy: containing noise, errors, or outliers e.g., Salary=“−10”

(an error)
 Inconsistent: containing discrepancies in codes or names,

e.g.,
 Age=“42”, Birthday=“03/07/2010”
Incomplete (Missing) Data
 Data is not always available

 E.g., many tuples have no recorded value for several

attributes, such as customer income in sales data


 Missing data may be due to

 equipment malfunction

 inconsistent with other recorded data and thus deleted

 certain data may not be considered important at the time

of entry
 Missing data may need to be inferred

 Manual Entry (mean /probable)


How to Handle Missing Data?
 Ignore the tuple:

 Usually done when class label is missing (when doing

classification)—not effective when the % of missing values


per attribute varies considerably
 Fill in the missing value manually:

 tedious + infeasible?

 Fill in it automatically with

 A global constant : e.g., “unknown”, a new class?

 Use any central tendency measure:

 Mean for symmetric


 Median for skewed
How to Handle Missing Data?
The attribute mean for all samples belonging to the same
class:
 For example, while filling values of annual income
according to credit risk, fill value from the incomes of
customers of same class.
 The most probable value: inference-based such as
Bayesian formula or decision tree
Noisy Data
 Noise:

 Random error in a measured variable

 Incorrect attribute values may be due to

 faulty data collection instruments


 data entry problems
 data transmission problems

22
How to Handle Noisy Data?
 Regression

 Smooth by fitting the data into regression functions.

 Finding the best line to fit two attributes so that one

attribute can be used to predict other’s value.


 Clustering

 Detect and remove outliers

 Combined computer and human inspection

 Detect suspicious values and check by human (e.g., deal

with possible outliers)


Dimensionality Reduction
 Dimensionality reduction, e.g., remove unimportant attributes or

reduce no of input variable


 It represent the original data in the compressed or reduced

form by applying data encoding and transformation.


 It’s a encoding mechanism is used to reduce size.

 If original data can be reconstruct from compressed data

without loosing any information is called lossless.


 If reconstructed data is the approximation of compressed data

is called lossy.
 Wavelet transforms
 Principal Components Analysis (PCA)
 Feature subset selection, feature creation
Feature/Attribute Selection
 Feature Selection is the method of reducing the input

variable to your model by using only relevant data and


getting rid of noise in data.
 It is the process of automatically choosing relevant features for

your machine learning model based on the type of problem you


are trying to solve. Following are the feature selection method
 Filter Methods: ( IG, Chi-Square Test, and Correlation
Coefficient)
 Wrapper Methods: (Recursive Feature Elimination and Genetic

Algorithms)
 Embedded Methods (Decision Tree)
Feature/Attribute Subset Selection
 Remove redundant and irrelevant attributes

 Redundant attributes

 Duplicate much or all of the information contained in one or more

other attributes
 E.g., purchase price of a product and the amount of sales tax paid

 Irrelevant attributes

 Contain no information that is useful for the data mining task at

hand
 E.g., students' ID is often irrelevant to the task of predicting

students' GPA
Domain Expert
Heuristic Search
Heuristic Search in Attribute
Selection
There are 2d possible attribute combinations of d attributes
 Not feasible.

Typical heuristic attribute selection methods:

 Best single attribute under the attribute independence

assumption: choose by significance tests (in Chapter 7)


 Best step-wise feature selection:

 The best single-attribute is picked first


 Then next best attribute condition to the first, ...

 Step-wise attribute elimination:

 Repeatedly eliminate the worst attribute

 Best combined attribute selection and elimination


Heuristic Search in Attribute
Selection
Decision Tree Induction
 Decision tree constructs a flow chart like structure

where each internal node denotes a test on


attribute, each corresponds to an outcome of the
test, each external leaf denotes a class prediction.
 At each node, algorithm selects best attribute

to partition data in subclasses.


 All non-selected attributes are considered irrelevant.
Decision Tree Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

Reduced
> attribute set: {A1, A4, A6}
Numerosity Reduction
 Numerosity reduction (some simply call it: Data
Reduction)
 Replace original data volume by alternative, smaller forms of

data representation
 Parametric (Regression):

 Assume the data fits some model, estimate model


parameters, store only the parameters, and discard the
data (except possible outliers)
 Non-Parametric:
 Do not assume models
 Major families: histograms, clustering, sampling, …

 Data compression
Regression and Log-Linear
Models
 Linear Regression

 Data modeled to fit a straight line

Y = w X + b

 Two regression coefficients, w and b, specify the line and

are to be estimated by using the data at hand


 Multiple Regression

 Allows a response variable Y to be modeled as a linear

function of multidimensional feature vector.


 Y = b 0 + b1 X1 + b2 X2

 Many nonlinear functions can be transformed into the

above
Histogram Analysis
 Divide data into buckets and store average (sum) for each

bucket
 Partitioning rules:

 Equal-width: equal bucket range

 Equal-frequency (or equal-depth)


Clustering
 Partition data set into clusters based on similarity, and

store cluster representation (e.g., centroid and diameter)


only
 Centroid

 Diameter
Sampling
 Sampling: obtaining a small sample s to represent the whole

data set N
 Key principle: Choose a representative subset of the data

 Simple random sampling may have very poor performance

in the presence of skew


 Develop adaptive sampling methods, e.g., stratified
sampling:
Types of Sampling
 Simple random sampling

 There is an equal probability of selecting any particular item

 Sampling without replacement

 Once an object is selected, it is removed from the population

 Sampling with replacement

 A selected object is not removed from the population

 Stratified sampling:

 Partition the data set, and draw samples from each partition

(proportionally, i.e., approximately the same percentage of the


data)
 Used in conjunction with skewed data
Sampling: With or without
Replacement

R S WOR ndom
S le ra hout
i m p
(s
p l e wit
sam ment)
e p l a ce
r

SRSW
R

Raw Data
Data Compression
 String Compression

 There are extensive theories and well-tuned algorithms

(Huffman Encoding)
 Typically lossless, but only limited manipulation is
possible without expansion (Text Encryption)
 Audio/Video Compression

 Typically lossy compression, with progressive refinement

 Sometimes small fragments of signal can be


reconstructed without reconstructing the whole
 Compressed Ratio: Compressed Size/Original Size

 Compressed Factor: 1/Compressed Ratio


Data Compression

Original Data Compressed


Data
lossless

sy
los
Original Data
Approximated
Discretization
 It divides the range of attributes into intervals so as to

reduce number of values for a given continuous attribute.


 Splitting: Top down (Attribute is split into range of value)

 Merge: Bottom Up (Initially we consider all later remove

some during merging)


 Interval labels can then be used to replace actual data

values
 Reduce data size by discretization

 Supervised vs. unsupervised (if class information is used)

 Discretization can be performed recursively on an attribute


Discretization
 Concept Hierarchy
 Helps in reducing the data by collecting and replacing low level

concept with high level concepts. Eg. Mobile no and land line no
replace it with telephone number.

 Three types of attributes

 Nominal—values from an unordered set, e.g., color,


profession
 Ordinal—values from an ordered set, e.g., military or

academic rank
 Numeric—real numbers, e.g., integer or real numbers
Data Discretization Methods
 Typical methods: All the methods can be applied recursively

 Binning /Histogram

 Top-down split, unsupervised


 Clustering analysis (unsupervised, top-down split or
bottom-up merge)
 Decision-tree analysis (supervised, top-down split)

 Correlation (e.g., 2) analysis (unsupervised, bottom-up

merge)
Discretization by Classification
& Correlation Analysis
 Classification (e.g., decision tree analysis)

 Supervised: Given class labels, e.g., cancerous vs. benign

 Using entropy to determine split point (discretization point)

 Top-down, recursive split

 Details to be covered in Chapter 7

 Correlation analysis (e.g., Chi-merge: χ2-based discretization)

 Supervised: use class information

 Bottom-up merge: find the best neighboring intervals (those having

similar distributions of classes, i.e., low χ2 values) to merge


 Merge performed recursively, until a predefined stopping condition

43
How to Handle Noisy Data?
 Binning

 Smooths sorted data by consulting its


neighborhood.
 first sort data and partition into (equal-frequency) bins

 then one can smooth by bin

 means,
 median,
 bin boundaries, etc.
Simple Discretization:
Binning
 Equal-width (distance) partitioning
 Divides the range into N intervals of equal size: uniform

grid
 If A and B are the lowest and highest values of the

attribute, the width of intervals will be: W = (B –A)/N.


 The most straightforward, but outliers may dominate
presentation Equal-depth (frequency) partitioning
 Divides the range into N intervals, each containing
approximately same number of samples
Binning Methods for Data
Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Integration
 Combines data from multiple sources into a coherent store
 Careful integration can help reduce and avoid redundancies
and inconsistencies.
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g.,

Bill Clinton = William Clinton


 Be careful while Schema integration

E.g., A.cust-id  B.cust-#


 Detecting and resolving data value conflicts

 For the same real world entity, attribute values from


different sources are different
 Possible reasons: different representations, different
scales, e.g., metric vs. British units
49
Data Integration
 Integrate metadata from different sources

Metadata can be used to improve it


What is meta data ?
 Redundant data occur often when integration of multiple

databases
 Derivable data: One attribute may be a “derived” attribute in

another table, e.g., annual revenue


 Redundant attributes may be able to be detected by correlation

analysis and covariance analysis


 Careful integration of the data from multiple sources may help

reduce/avoid redundancies and inconsistencies and improve


mining speed and quality 50
Correlation Analysis (Nominal
Data)
 Χ2 (chi-square) test

2
(Observed  Expected )
 
2

Expected
 The larger the Χ2 value, the more likely the variables are
related

 The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
Correlation Analysis (Numeric
Data)
 Correlation coefficient (also called Pearson’s product moment

coefficient)
 
n n
(ai  A)(bi  B ) (ai bi )  n AB
rA, B  i 1
 i 1
(n) A B (n) A B

A Band
where n is the number of tuples, are the respective
means of A and B, σA and σB are the respective standard deviation

of A and B, and Σ(aibi) is the sum of the AB cross-product.

 If rA,B > 0, A and B are positively correlated (A’s values increase as

B’s). The higher, the stronger correlation.


 rA,B = 0: independent; rAB < 0: negatively correlated
Covariance (Numeric Data)
Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples,


A and
B are the respective
mean or expected values of A and B, σA and σB are the
respective standard deviation of A and B.
Covariance (Numeric Data)
 Positive covariance: If CovA,B > 0, then A and B both tend to be

larger than their expected values.


 Negative covariance: If CovA,B < 0 then if A is larger than its

expected value, B is likely to be smaller than its expected value.


 Independence: CovA,B = 0 but the converse is not true:

 Some pairs of random variables may have a covariance of 0

but are not independent. Only under some additional


assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
Co-Variance: An Example

 It can be simplified in computation as


Co-Variance: An Example
 Suppose two stocks A and B have the following values in one

week:

(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
 Question: If the stocks are affected by the same industry
trends, will their prices rise or fall together?

 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

 Thus, A and B rise together since Cov(A, B) > 0.


Normalization
 Min-max normalization: to [new_minA, new_maxA]

v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA

 Ex. Let income range $12,000 to $98,000 normalized to

[0.0, 1.0]. Then $73,000 is mapped to

73,600  12,000
(1.0  0)  0 0.716
98,000  12,000
Normalization
 Z-score normalization (μ: mean, σ: standard deviation):

 Ex. Let μ = 54,000, σ = 16,000. Then

v  A
v' 
 A

73,600  54,000
1.225
16,000
Z-score Normalization
 Simply put, a z-score (also called a standard score) gives

you an idea of how far from the mean a data point is.
 However, technically it’s a measure of how many standard

deviations below or above the population mean a raw score


is.
 A z-score can be placed on a normal distribution curve. Z-

scores range from -3 standard deviations (which would fall


to the far left of the normal distribution curve) up to +3
standard deviations (which would fall to the far right of the
normal distribution curve).
 In order to use a z-score, you need to know the mean μ and
Z-score Normalization
 Z-scores are a way to compare results to a “normal”

population.
 Results from tests or surveys have thousands of possible

results and units; those results can often seem meaningless.


 For example, knowing that someone’s weight is 150 pounds

might be good information, but if you want to compare it to


the “average” person’s weight, looking at a vast table of
data can be overwhelming (especially if some weights are
recorded in kilograms).
 A z-score can tell you where that person’s weight is

compared to the average population’s mean weight.


Self Study
 Following topics are not included in syllabus for
exams/quizzes.
 If you are interested in any of these topics, you can come to

my office to learn/discuss.
Segmentation by Natural
Partitioning
 A simply 3-4-5 rule can be used to segment numeric data

into relatively uniform, “natural” intervals.


 If an interval covers 3, 6, 7 or 9 distinct values at the

most significant digit, partition the range into 3


(relatively) equi-width intervals
 If it covers 2, 4, or 8 distinct values at the most

significant digit, partition the range into 4 intervals


 If it covers 1, 5, or 10 distinct values at the most

significant digit, partition the range into 5 intervals


Example of 3-4-5 Rule
count

Step 1: -$351 -$159 profit $1,838 $4,700


Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$4000 -$5,000)
Step 4:

($2,000 - $5, 000)


(-$400 - 0) (0 - $1,000) ($1,000 - $2, 000)
(0 -
($1,000 -
(-$400 - $200)
$1,200) ($2,000 -
-$300) $3,000)
($200 -
($1,200 -
$400)
(-$300 - $1,400)
($3,000 -
-$200)
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) ($600 - ($1,600 - $5,000)
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
Concept Hierarchy Generation
 Concept hierarchy organizes concepts (i.e., attribute values)

hierarchically
 Concept hierarchy formation: Recursively reduce the data by

collecting and replacing low level concepts (such as numeric


values for age) by higher level concepts (such as youth,
adult, or senior)
 Concept hierarchies can be explicitly specified by domain

experts and/or data warehouse designers


 Concept hierarchy can be automatically formed for both

numeric and nominal data. For numeric data, use 3-4-5 rule.
Concept Hierarchy Generation
for Nominal Data
 Specification of a partial/total ordering of attributes explicitly

at the schema level by users or experts

 street < city < state < country

 Specification of only a partial set of attributes

 E.g., only street < city, not others


Automatic Concept Hierarchy Generation
 Some hierarchies can be automatically generated based on the

analysis of the number of distinct values per attribute in the data set
 The attribute with the most distinct values is placed at the lowest

level of the hierarchy


 Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

You might also like