[go: up one dir, main page]

0% found this document useful (0 votes)
14 views16 pages

CH2 Data Integration - Transformation

Uploaded by

Hunzila Nisar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views16 pages

CH2 Data Integration - Transformation

Uploaded by

Hunzila Nisar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

ARIN-2137

KNOWLEDGE DISCOVERY AND DATA


MINING

TOPIC 2 :
Data Preparation –
Integration & Transformation

1
Where we are? Introduction to KDD

KDD
Data
steps
Understanding dataset & Domain

Data Preparation
Do data preprocessing

Run Data Mining

Evaluate result

Utilize knowledge

2
Chapter Outline (Data Integration)

• What is data integration


• Issues with data
integration
• How to integrate data

3
Data Integration
“combines data from multiple sources into a coherent store”

• Schema integration
– integrate metadata from different sources
– Entity identification problem: identify real world entities from multiple data
sources, e.g., A.cust-id  B.cust-#

Student_id (IUB’s database)


Are they referring to the same data?
Student_# (MOE’s database)

• Detecting and resolving data value conflicts


– for the same real world entity, attribute values from different sources are different
– possible reasons: different representations, different scales,

e.g., metric vs. British units | km vs. miles


4
Data Integration
• Redundant data occur often when integration of
multiple databases
– The same attribute may have different names in different
databases
– One attribute may be a “derived” attribute in another table,
e.g., annual revenue

• Careful integration of the data from multiple sources


may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
5
Data Integration

Record A
PersonID Age Street Salary

100 27 RYK2 2214.50 New Record


110 21 LHR2 1999.00 ID Age Street Salary

112 25 RYK1 2105.99 100 27 RYK 2 2214.50

115 30 RYK2 5000.00 110 21 LHR 2 1999.00

112 25 RYK1 2105.99

Record B 115 30 RYK 2 5000.00


Person_# Age Street Salary
116 29 LHR B 4555.00
100 27 RYK 2 2214.50
119 40 LHR 2 4998.00
110 21 LHR 2 1999.00

116 29 LHR B 4555.00

119 40 LHR 2 4998.00


Transformers

7
Major Tasks in Data Preprocessing

Identifyin
Data Data Evaluation &
g knowledge
Preprocessing Mining Presentation
Data
Sources

Data Data Data Data Data


Cleaning Integration Transformati Reduction Discretization
on
Fill in missing Integration Normalizatio Obtains reduced Part of data
values, smooth of multiple n& representation reduction
noisy data, databases, aggregation in volume but but with
identify or remove data cubes, produces the particular
outliers, and or files same or similar importance,
resolve analytical especially 8
inconsistencies results for
numerical
Chapter Outline (Data Transformation)

• What is data
transformation
• Techniques to transform
data
• Normalization techniques

9
Data Transformation

:: Transform data into appropriate form for mining


Normalization

attribute data are scaled within


specified range
Smoothing • min-max
How? • z-score
remove • decimal scaling
noise from
data Aggregation
(binning, Attribute/feature
regression, summarizat Generalization construction
clustering) ion, data
cube concept New
constructio hierarchy attributes
n climbing. low- constructed 10
level data  from the
DATA TRANSFORMATION

• Smoothing, Aggregation
PersonID Age Street Salary PersonID Age Street Salary

100 27 RYK 2 2214.50 100 Age 2 SP Sal 2

110 21 LHR 2 1999.00 110 Age 1 PUNJAB Sal 1

112 25 RYK 1 2105.99 112 Age 1 SP Sal 2

115 30 RYK 2 5000.00 115 Age 2 SP Sal 4

116 29 LHR B 4555.00 116 Age 2 PUNJAB Sal 4

119 40 LHR 2 4998.00 119 Age 3 PUNJAB Sal 4

Street: Salary:
Age:
RYK 2  SP < 2000  Sal 1
20 – 25  Age 1 RYK1  SP
26 – 30  Age 2 2000 < x < 3000  Sal 2
LHR 2  PUNJAB 3000 < x < 4000  Sal 3
31 – 40  Age 3 LHR B  PUNJAB > 4000  Sal 4
Data Transformation -
Normalization
• Re-scale the data into a suitable range (an appropriate range).

• WHY ?: To increase processing speed and reduce memory allocation.

• HOW ?

Min-Max

Z-score

Decimal
scaling
Data Normalization (I)
•Min-Max Normalization  linear transformation of
the original data to newly specified range.

actual data - min


y' = *[max ' - min'] + min'
max - min

•Eg: if actual data range is 5 – 100 to be normalized


to 0-1 range, find normalized value for data = 50:
actual data - 5
y' = *[1 - 0] + 0
100 - 5
actual data - 5
y' = *1
95
50 - 5
y' = *1 0.474
95
Data Normalization (II)
• Z-Score Normalization  useful when the extreme value is
unknown or outlier dominates the extreme values.

actual data - mean value


Z=
standard deviation
Eg: if mean and standard deviation of the values for
attribute “Salary” are $54,000 and $16,000. Find new
value for attribute value of Salary “73,600” using Z-
Score.
= ???
DATA NORMALIZATION (III)
•Decimal Scaling  divide the value by 10 power n,
where n is the number of digits of the maximum
absolute value. y
y' = n
10
•Eg:
Suppose the range of attribute X is −500 to 45. The
maximum absolute value of X is 500. To normalize
by decimal scaling we will divide each value by
1,000 (c = 3). In this case, −500 becomes −0.5
while 45 will become 0.045.

15
Solid Temp ForceApplied Class
1 Yes Low 125 No
2 No High 100 No
TRY !
3 No Low 70 No
4 Yes High 120 No
5 No Medium 95 Yes
6 No High 60 No
7 Yes Medium 220 No
8 No Low 85 Yes
9 No High 75 No
10 No Low 90 Yes

1. Normalize attribute ‘ForceApplied’ into the range of -1 and 1.


2. Using the Z-score normalization, transform ALL data in the 4th column.

16

You might also like