ARIN-2137
KNOWLEDGE DISCOVERY AND DATA
MINING
TOPIC 2 :
Data Preparation –
Integration & Transformation
1
Where we are? Introduction to KDD
KDD
Data
steps
Understanding dataset & Domain
Data Preparation
Do data preprocessing
Run Data Mining
Evaluate result
Utilize knowledge
2
Chapter Outline (Data Integration)
• What is data integration
• Issues with data
integration
• How to integrate data
3
Data Integration
“combines data from multiple sources into a coherent store”
• Schema integration
– integrate metadata from different sources
– Entity identification problem: identify real world entities from multiple data
sources, e.g., A.cust-id B.cust-#
Student_id (IUB’s database)
Are they referring to the same data?
Student_# (MOE’s database)
• Detecting and resolving data value conflicts
– for the same real world entity, attribute values from different sources are different
– possible reasons: different representations, different scales,
e.g., metric vs. British units | km vs. miles
4
Data Integration
• Redundant data occur often when integration of
multiple databases
– The same attribute may have different names in different
databases
– One attribute may be a “derived” attribute in another table,
e.g., annual revenue
• Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
5
Data Integration
Record A
PersonID Age Street Salary
100 27 RYK2 2214.50 New Record
110 21 LHR2 1999.00 ID Age Street Salary
112 25 RYK1 2105.99 100 27 RYK 2 2214.50
115 30 RYK2 5000.00 110 21 LHR 2 1999.00
112 25 RYK1 2105.99
Record B 115 30 RYK 2 5000.00
Person_# Age Street Salary
116 29 LHR B 4555.00
100 27 RYK 2 2214.50
119 40 LHR 2 4998.00
110 21 LHR 2 1999.00
116 29 LHR B 4555.00
119 40 LHR 2 4998.00
Transformers
7
Major Tasks in Data Preprocessing
Identifyin
Data Data Evaluation &
g knowledge
Preprocessing Mining Presentation
Data
Sources
Data Data Data Data Data
Cleaning Integration Transformati Reduction Discretization
on
Fill in missing Integration Normalizatio Obtains reduced Part of data
values, smooth of multiple n& representation reduction
noisy data, databases, aggregation in volume but but with
identify or remove data cubes, produces the particular
outliers, and or files same or similar importance,
resolve analytical especially 8
inconsistencies results for
numerical
Chapter Outline (Data Transformation)
• What is data
transformation
• Techniques to transform
data
• Normalization techniques
9
Data Transformation
:: Transform data into appropriate form for mining
Normalization
attribute data are scaled within
specified range
Smoothing • min-max
How? • z-score
remove • decimal scaling
noise from
data Aggregation
(binning, Attribute/feature
regression, summarizat Generalization construction
clustering) ion, data
cube concept New
constructio hierarchy attributes
n climbing. low- constructed 10
level data from the
DATA TRANSFORMATION
• Smoothing, Aggregation
PersonID Age Street Salary PersonID Age Street Salary
100 27 RYK 2 2214.50 100 Age 2 SP Sal 2
110 21 LHR 2 1999.00 110 Age 1 PUNJAB Sal 1
112 25 RYK 1 2105.99 112 Age 1 SP Sal 2
115 30 RYK 2 5000.00 115 Age 2 SP Sal 4
116 29 LHR B 4555.00 116 Age 2 PUNJAB Sal 4
119 40 LHR 2 4998.00 119 Age 3 PUNJAB Sal 4
Street: Salary:
Age:
RYK 2 SP < 2000 Sal 1
20 – 25 Age 1 RYK1 SP
26 – 30 Age 2 2000 < x < 3000 Sal 2
LHR 2 PUNJAB 3000 < x < 4000 Sal 3
31 – 40 Age 3 LHR B PUNJAB > 4000 Sal 4
Data Transformation -
Normalization
• Re-scale the data into a suitable range (an appropriate range).
• WHY ?: To increase processing speed and reduce memory allocation.
• HOW ?
Min-Max
Z-score
Decimal
scaling
Data Normalization (I)
•Min-Max Normalization linear transformation of
the original data to newly specified range.
actual data - min
y' = *[max ' - min'] + min'
max - min
•Eg: if actual data range is 5 – 100 to be normalized
to 0-1 range, find normalized value for data = 50:
actual data - 5
y' = *[1 - 0] + 0
100 - 5
actual data - 5
y' = *1
95
50 - 5
y' = *1 0.474
95
Data Normalization (II)
• Z-Score Normalization useful when the extreme value is
unknown or outlier dominates the extreme values.
actual data - mean value
Z=
standard deviation
Eg: if mean and standard deviation of the values for
attribute “Salary” are $54,000 and $16,000. Find new
value for attribute value of Salary “73,600” using Z-
Score.
= ???
DATA NORMALIZATION (III)
•Decimal Scaling divide the value by 10 power n,
where n is the number of digits of the maximum
absolute value. y
y' = n
10
•Eg:
Suppose the range of attribute X is −500 to 45. The
maximum absolute value of X is 500. To normalize
by decimal scaling we will divide each value by
1,000 (c = 3). In this case, −500 becomes −0.5
while 45 will become 0.045.
15
Solid Temp ForceApplied Class
1 Yes Low 125 No
2 No High 100 No
TRY !
3 No Low 70 No
4 Yes High 120 No
5 No Medium 95 Yes
6 No High 60 No
7 Yes Medium 220 No
8 No Low 85 Yes
9 No High 75 No
10 No Low 90 Yes
1. Normalize attribute ‘ForceApplied’ into the range of -1 and 1.
2. Using the Z-score normalization, transform ALL data in the 4th column.
16