[go: up one dir, main page]

0% found this document useful (0 votes)
60 views60 pages

M2 PPT

Data pre processing

Uploaded by

r8342254
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views60 pages

M2 PPT

Data pre processing

Uploaded by

r8342254
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

DATA (PRE-)PROCESSING

In Previous Class,
• We discuss various type of Data with examples
• In this Class,
• We focus on Data pre-processing – “an important
milestone of the Data Mining Process”
Data analysis pipeline
 Mining is not the only step in the analysis process

 Preprocessing: real data is noisy, incomplete and inconsistent.


Data cleaning is required to make sense of the data
 Techniques: Sampling, Dimensionality Reduction, Feature Selection.
 Post-Processing: Make the data actionable and useful to the user
Statistical analysis of importance & Visualization.
Why Preprocess the Data

Measures for Data Quality: A Multidimensional View


Accuracy: Correct or Wrong, Accurate or Not
Completeness: Not recorded, unavailable,…
Consistency: Come modified but some not,...
Timeliness: Timely update?
Believability: how trustable the data are correct?
Interpretability: how easily the data can be understood?
Why Data Preprocessing?
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
• noisy: containing errors or outliers
• inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of quality data
• Required for both OLAP and Data Mining!
Why can Data be
Incomplete?
 Attributes of interest are not available (e.g., customer
information for sales transaction data)
 Data were not considered important at the time of
transactions, so they were not recorded!
 Data not recorder because of misunderstanding or
malfunctions
 Data may have been recorded and later deleted!
 Missing/unknown values for some data
Attribute Values
Data is described using attribute
values
Attribute values are numbers or symbols assigned to an attribute
 Distinction between attributes and attribute values
 Same attribute can be mapped to different attribute values
Example: height can be measured in feet or meters
 Different attributes can be mapped to the same set of
values
Example: Attribute values for ID and age are integers
 But properties of attribute values can be different
 ID has no limit but age has a maximum and minimum value
Types of Attributes
There are different types of attributes
 Nominal
 Examples: ID numbers, eye color, zip codes
 Ordinal
 Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height in {tall, medium,
short}
 Interval
 Examples: calendar dates
 Ratio
 Examples: length, time, counts
Discrete and Continuous Attributes
 Discrete Attribute
Has only a finite or count able in finite set of values
Examples: zip codes, counts, or the set of words in a
collection of documents
 Often represented as integer variables.

 Continuous Attribute
Has real numbers as attribute values
Examples : temperature, height, or weight.
Practically, real values can only be measured and
represented using a finite number of digits.
Data Preprocessing
Major Tasks in Data Preprocessing
outliers=exceptions!
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
• Part of data reduction but with particular importance, especially for
numerical data
Forms of data preprocessing
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, E g.,
Instrument faulty, human or computer error, Transmission error
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
How to Handle Missing
Data?
• Ignore the tuple: usually done when class label is missing (assuming the tasks
in classification)—not effective when the percentage of missing values per
attribute varies considerably.

• Fill in the missing value manually: tedious + infeasible?

• Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!

• Use the attribute mean to fill in the missing value

• Use the attribute mean for all samples belonging to the same class to fill in the
missing value: smarter

• Use the most probable value to fill in the missing value: inference-based such
as Bayesian formula or decision tree
How to Handle Missing
Data?

Age Income Religion Gender


23 24,200 Muslim M
39 ? Christian F
45 45,390 ? F

Fill missing values using aggregate functions (e.g., average) or probabilistic estimates
on global value distribution
E.g., put the average income here, or put the most probable income based on the fact
that the person is 39 years old
E.g., put the most frequent religion here
Data Quality

Data has attribute values


Then,
How good our Data w.r.t. these attribute
values?
Data Quality
 Examples of data quality problems:
 Noise and outliers
 Missing values
 Duplicate data
Data Quality: Noise
 Noise refers to modification of original values
 Examples: distortion of a person’s voice when talking on
Data Quality: Outliers
 Outliers are data objects with characteristics that are
considerably different than most of the other data objects in the
data set
Data Quality: Missing Values
 Reasons for missing values
•Information is not collected
•(e.g., people decline to give their age and weight)
•Attributes may not be applicable to all cases (e.g.,
annual income is not applicable to children)
• Handling missing values
•Eliminate Data Objects
•Estimate Missing Values
•Ignore the Missing Value During Analysis
•Replace with all possible values (weighted by their
probabilities)
Data Quality: Duplicate Data
Data set may include data objects that are duplicates, or
almost duplicates of one another
 Major issue when merging data from heterogeous
sources
Examples:
Same person with multiple email addresses
 Data cleaning
 Process of dealing with duplicate data issues
Data Quality: Handle Noise(Binning)
 Binning
 sort data and partition into (equi-depth) bins
 smooth by bin means, bin median, bin boundaries,
etc.
 Regression
 smooth by fitting a regression function
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values automatically and check
by human
Data Quality: Handle Noise(Binning)
 Equal-width binning
 Divides the range into N intervals of equal size Width
of intervals:
 Simple
 Outliers may dominate result

 Equal-depth binning
Divides the range into N intervals,
each containing approximately same number of records
Skewed data is also handled well
Simple Methods: Binning
Data Quality: Handle
Noise(Binning)
Example: Sorted price values 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
• Partition into three (equi-depth) bins
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
• Smoothing by bin means
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Quality: Handle
Noise(Regression)
•Replace noisy or missing values by
predicted values
•Requires model of attribute
dependencies (maybe wrong!)
•Can be used for data smoothing or
for handling missing data
Data Integration
The process of combining multiple sources into a single dataset. The Data
integration process is one of the main components in data management.
Data integration:
Combines data from multiple sources into a coherent store
Schema integration integrate metadata from different sources
metadata: data about the data (i.e., data descriptors)
Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id  B.cust-#
Detecting and resolving data value conflicts for the same real
world entity, attribute values from different sources are
different (e.g., J.D.Smith and Jonh Smith may refer to the same
person)
possible reasons: different representations, different scales,
e.g., metric vs. British units (inches vs. cm)
Handling Redundant Data in Data Integration

• Redundant data occur often when integration of multiple databases


• The same attribute may have different names in different
databases
• One attribute may be a “derived” attribute in another table, e.g.,
annual revenue
• Redundant data may be able to be detected by correlation analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality
Data Transformation
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Attribute/feature construction
• New attributes constructed from the given ones
Normalization: Why normalization?
• Speeds-up some learning techniques (ex. neural networks)
• Helps prevent attributes with large ranges outweigh ones
with small ranges
• Example:
• income has range 3000-200000
• age has range 10-80
• gender has domain M/F
Data Transformation
Data has an attribute values
Then,
Can we compare these attribute values?
For Example: Compare following two records
(1) (5.9 ft, 50 Kg)
(2) (4.6 ft, 55 Kg) Vs.
(3) (5.9 ft, 50 Kg)
(4) (5.6 ft, 56 Kg)
We need Data Transformation to makes different
dimension(attribute) records comparable ...
Data Transformation Techniques
Normalization: scaled to fall within a small,
specified range.
min-max normalization
z-score normalization
normalization by decimal scaling

 Centralization:
 Based on fitting a distribution to the data
 Distance function between distributions
 KL Distance
 Mean Centering
Data Transformation: Normalization
Example: Data Transformation
- Assume, min and max value for height and weight.
- Now, apply Min-Max normalization to both attributes as given
follow
(1) (5.9 ft, 50 Kg)
(2) (4.6 ft, 55 Kg)
Vs.
(1) (5.9 ft, 50 Kg)
(2) (5.6 ft, 56 Kg)
- Compare your results...
Data Transformation:
Aggregation
 Combining two or more attributes (or objects) into a
single attribute (or object)
 Purpose
 Data reduction
 Reduce the number of attributes or objects
 Change of scale
 Cities aggregated into regions, states, countries,
etc
 More “stable” data
 Aggregated data tends to have less variability
Data Transformation: Discretization

 Motivation for Discretization


 Some data mining algorithms only accept
categorical attributes
 May improve understandability of patterns
Data Transformation: Discretization
 Task
• Reduce the number of values for a given
continuous attribute by partitioning the range of
the attribute into intervals
• Interval labels replace actual attribute values
 Methods
• Binning (as explained earlier)
• Cluster analysis (will be discussed later)
• Entropy-based Discretization (Supervised)
Simple Discretization Methods: Binning
 Equal-width (distance) partitioning:
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the
width of intervals will
be: W = (B –A)/N.
The most straightforward, but outliers may dominate presentation
Skewed data is not handled well.

 Equal-depth (frequency) partitioning:


Divides the range into N intervals, each containing approximately
same number of samples
Good data scaling
Managing categorical attributes can be tricky.
Data Reduction Strategies
Warehouse may store terabytes of data: Complex data analysis/mining
may take a very long time to run on the complete data set
• Data reduction
• Obtains a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same)
analytical results
• Data reduction strategies
• Data cube aggregation
• Dimensionality reduction
• Data compression
• Numerosity reduction
• Discretization and concept hierarchy generation
Techniques of Data Reduction
Techniques or methods of data reduction in data mining, such
as
Dimensionality Reduction

• The reduction of random variables or


attributes is done so that the
dimensionality of the data set can be
reduced.
• Combining and merging the attributes of
the data without losing its original
characteristics. This also helps in the
reduction of storage space and
computation time is reduced.
Numerosity Reduction:
Reduce the volume of data
The representation of the data is made smaller by reducing the volume.
There will not be any loss of data in this reduction.
• Parametric methods
• Assume the data fits some model, estimate model parameters, store
only the parameters, and discard the data (except possible outliers)
• Log-linear models: obtain value at a point in m-D space as the product
on appropriate marginal subspaces
• Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling
Data Cube Aggregation
• The lowest level of a data cube
• the aggregated data for an individual entity of interest
• e.g., a customer in a phone calling data warehouse.
• Multiple levels of aggregation in data cubes
• Further reduce the size of data to deal with
• Reference appropriate levels
• Use the smallest representation which is enough to solve the task
• Queries regarding aggregated information should be answered using
data cube, when possible
Data Compression

Original Data Compressed


Data
lossless

s sy
lo
Original Data
Approximated
Histograms
40
• A popular data reduction
technique 35
• Divide data into buckets and 30
store average (or sum) for each
25
bucket
20
• Can be constructed optimally
in one dimension using 15
dynamic programming 10
• Related to quantization 5
problems.
0
10000 30000 50000 70000 90000
Histogram types
• Equal-width histograms:
• It divides the range into N intervals of equal size
• Equal-depth (frequency) partitioning:
• It divides the range into N intervals, each containing approximately same number of
samples
• V-optimal:
• It considers all histogram types for a given number of buckets and chooses the one
with the least variance.
• MaxDiff:
• After sorting the data to be approximated, it defines the borders of the buckets at
points where the adjacent values have the maximum difference
• Example: split 1,1,4,5,5,7,9,14,16,18,27,30,30,32 to three buckets
Clustering
• Partitions data set into clusters, and models it by one representative
from each cluster

• Can be very effective if data is clustered but not if data is “smeared”

• There are many choices of clustering definitions and clustering


algorithms
Cluster Analysis
the distance between points in the
salary
same cluster should be small

cluster

outlier

age
Hierarchical Reduction
• Use multi-resolution structure with different degrees of reduction
• Hierarchical clustering is often performed but tends to define
partitions of data sets rather than “clusters”
• Parametric methods are usually not amenable to hierarchical
representation
• Hierarchical aggregation
• An index tree hierarchically divides a data set into partitions by value range
of some attributes
• Each partition can be considered as a bucket
• Thus an index tree with aggregates stored at each node is a hierarchical
histogram
Discretization
• Three types of attributes:
• Nominal — values from an unordered set
• Ordinal — values from an ordered set
• Continuous — real numbers
• Discretization:
• divide the range of a continuous attribute into intervals
• why?
• Some classification algorithms only accept categorical
attributes.
• Reduce data size by discretization
• Prepare for further analysis
Discretization and Concept hierarchy
• Discretization
• reduce the number of values for a given continuous attribute
by dividing the range of the attribute into intervals. Interval
labels can then be used to replace actual data values.

• Concept hierarchies
• reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior).
Discretization and concept hierarchy
generation for numeric data
• Binning/Smoothing

• Histogram analysis

• Clustering analysis

• Entropy-based discretization

• Segmentation by natural partitioning


Entropy-Based Discretization m
Ent(S1 ) = - å pi log2 ( pi )
i =1

Entropy:

• Given a set of samples S, if S is partitioned into two


intervals S1 and S2 using boundary T, the information gain
I(S,T) after partitioning is
|S1 | |S 2 |
I (S, T ) = Ent(S1) + Ent(S 2)
|S | |S |
• The boundary that maximizes the information gain over all
possible boundaries is selected as a binary discretization.
• The process is recursively applied to partitions obtained
until some stopping criterion is met, e.g.,
Ent(S ) - I (T , S ) > d
• Experiments show that it may reduce data size and improve
classification accuracy
Segmentation by natural partitioning
 Users often like to see numerical ranges partitioned into
relatively uniform, easy-to-read intervals that appear intuitive or
“natural”. E.g., [50-60] better than [51.223-60.812]
The 3-4-5 rule can be used to segment numerical data into
relatively uniform, “natural” intervals.
* If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit,
partition the range into 3 equiwidth intervals for 3,6,9 or 2-3-2 for 7
* If it covers 2, 4, or 8 distinct values at the most significant digit, partition the
range into 4 equiwidth intervals
* If it covers 1, 5, or 10 distinct values at the most significant digit, partition the
range into 5 equiwidth intervals
The rule can be recursively applied for the resulting intervals
Python - Data Wrangling
Data wrangling is the process of cleaning and unifying messy and complex data sets
for easy access and analysis
Working with raw data sucks.

• Data comes in all shapes and sizes – CSV files, PDFs, stone
tablets, .jpg…

• Different files have different forming – Spaces instead of


NULLs, extra rows

• “Dirty” data – Unwanted anomalies – Duplicates


Principal Component Analysis

Principal Component Analysis, or PCA, is a


dimensionality-reduction method that is often used to
reduce the dimensionality of large data sets, by
transforming a large set of variables into a smaller one
that still contains most of the information in the large
set.
Principal Component Analysis or
Karhuren-Loeve (K-L) method
• Given N data vectors from k-dimensions, find c <= k
orthogonal vectors that can be best used to represent data
• The original data set is reduced to one consisting of N data vectors
on c principal components (reduced dimensions)
• Each data vector is a linear combination of the c principal
component vectors
• Works for numeric data only
• Used when the number of dimensions is large
Principal Component Analysis

X1, X2: original axes (attributes) X2


Y1,Y2: principal components
Y1
Y2
significant component
(high variance)

X1

Order principal components by significance and eliminate weaker ones

You might also like