0% found this document useful (0 votes)

14 views16 pages

CH2 Data Integration - Transformation

Uploaded by

Hunzila Nisar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views16 pages

CH2 Data Integration - Transformation

Uploaded by

Hunzila Nisar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 16

ARIN-2137

KNOWLEDGE DISCOVERY AND DATA

MINING

TOPIC 2 :
Data Preparation –
Integration & Transformation

1
Where we are? Introduction to KDD

KDD
Data
steps
Understanding dataset & Domain

Data Preparation
Do data preprocessing

Run Data Mining

Evaluate result

Utilize knowledge

2
Chapter Outline (Data Integration)

• What is data integration

• Issues with data
integration
• How to integrate data

3
Data Integration
“combines data from multiple sources into a coherent store”

• Schema integration
– integrate metadata from different sources
– Entity identification problem: identify real world entities from multiple data
sources, e.g., A.cust-id  B.cust-#

Student_id (IUB’s database)

Are they referring to the same data?
Student_# (MOE’s database)

• Detecting and resolving data value conflicts

– for the same real world entity, attribute values from different sources are different
– possible reasons: different representations, different scales,

e.g., metric vs. British units | km vs. miles

4
Data Integration
• Redundant data occur often when integration of
multiple databases
– The same attribute may have different names in different
databases
– One attribute may be a “derived” attribute in another table,
e.g., annual revenue

• Careful integration of the data from multiple sources

may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
5
Data Integration

Record A
PersonID Age Street Salary

100 27 RYK2 2214.50 New Record

110 21 LHR2 1999.00 ID Age Street Salary

112 25 RYK1 2105.99 100 27 RYK 2 2214.50

115 30 RYK2 5000.00 110 21 LHR 2 1999.00

112 25 RYK1 2105.99

Record B 115 30 RYK 2 5000.00

Person_# Age Street Salary
116 29 LHR B 4555.00
100 27 RYK 2 2214.50
119 40 LHR 2 4998.00
110 21 LHR 2 1999.00

116 29 LHR B 4555.00

119 40 LHR 2 4998.00

Transformers

7
Major Tasks in Data Preprocessing

Identifyin
Data Data Evaluation &
g knowledge
Preprocessing Mining Presentation
Data
Sources

Data Data Data Data Data

Cleaning Integration Transformati Reduction Discretization
on
Fill in missing Integration Normalizatio Obtains reduced Part of data
values, smooth of multiple n& representation reduction
noisy data, databases, aggregation in volume but but with
identify or remove data cubes, produces the particular
outliers, and or files same or similar importance,
resolve analytical especially 8
inconsistencies results for
numerical
Chapter Outline (Data Transformation)

• What is data
transformation
• Techniques to transform
data
• Normalization techniques

9
Data Transformation

:: Transform data into appropriate form for mining

Normalization

attribute data are scaled within

specified range
Smoothing • min-max
How? • z-score
remove • decimal scaling
noise from
data Aggregation
(binning, Attribute/feature
regression, summarizat Generalization construction
clustering) ion, data
cube concept New
constructio hierarchy attributes
n climbing. low- constructed 10
level data  from the
DATA TRANSFORMATION

• Smoothing, Aggregation
PersonID Age Street Salary PersonID Age Street Salary

100 27 RYK 2 2214.50 100 Age 2 SP Sal 2

110 21 LHR 2 1999.00 110 Age 1 PUNJAB Sal 1

112 25 RYK 1 2105.99 112 Age 1 SP Sal 2

115 30 RYK 2 5000.00 115 Age 2 SP Sal 4

116 29 LHR B 4555.00 116 Age 2 PUNJAB Sal 4

119 40 LHR 2 4998.00 119 Age 3 PUNJAB Sal 4

Street: Salary:
Age:
RYK 2  SP < 2000  Sal 1
20 – 25  Age 1 RYK1  SP
26 – 30  Age 2 2000 < x < 3000  Sal 2
LHR 2  PUNJAB 3000 < x < 4000  Sal 3
31 – 40  Age 3 LHR B  PUNJAB > 4000  Sal 4
Data Transformation -
Normalization
• Re-scale the data into a suitable range (an appropriate range).

• WHY ?: To increase processing speed and reduce memory allocation.

• HOW ?

Min-Max

Z-score

Decimal
scaling
Data Normalization (I)
•Min-Max Normalization  linear transformation of
the original data to newly specified range.

actual data - min

y' = *[max ' - min'] + min'
max - min

•Eg: if actual data range is 5 – 100 to be normalized

to 0-1 range, find normalized value for data = 50:
actual data - 5
y' = *[1 - 0] + 0
100 - 5
actual data - 5
y' = *1
95
50 - 5
y' = *1 0.474
95
Data Normalization (II)
• Z-Score Normalization  useful when the extreme value is
unknown or outlier dominates the extreme values.

actual data - mean value

Z=
standard deviation
Eg: if mean and standard deviation of the values for
attribute “Salary” are $54,000 and $16,000. Find new
value for attribute value of Salary “73,600” using Z-
Score.
= ???
DATA NORMALIZATION (III)
•Decimal Scaling  divide the value by 10 power n,
where n is the number of digits of the maximum
absolute value. y
y' = n
10
•Eg:
Suppose the range of attribute X is −500 to 45. The
maximum absolute value of X is 500. To normalize
by decimal scaling we will divide each value by
1,000 (c = 3). In this case, −500 becomes −0.5
while 45 will become 0.045.

15
Solid Temp ForceApplied Class
1 Yes Low 125 No
2 No High 100 No
TRY !
3 No Low 70 No
4 Yes High 120 No
5 No Medium 95 Yes
6 No High 60 No
7 Yes Medium 220 No
8 No Low 85 Yes
9 No High 75 No
10 No Low 90 Yes

1. Normalize attribute ‘ForceApplied’ into the range of -1 and 1.

2. Using the Z-score normalization, transform ALL data in the 4th column.

Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Data Transformation
No ratings yet
Data Transformation
12 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
Unit II - Data Preprocessing and Classification RSK-1
No ratings yet
Unit II - Data Preprocessing and Classification RSK-1
115 pages
Lecture 10 - Data Transformation-M
No ratings yet
Lecture 10 - Data Transformation-M
8 pages
Data Pre Processing II
No ratings yet
Data Pre Processing II
26 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
4 Data Pre Processing II
No ratings yet
4 Data Pre Processing II
26 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
19 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
dmdw2 2
No ratings yet
dmdw2 2
24 pages
Data Transformation and Standardization
No ratings yet
Data Transformation and Standardization
5 pages
Preprocessing
No ratings yet
Preprocessing
90 pages
Sona College of Technology (Autonomous) : U23IT953-Data Warehousing and Data Mining
No ratings yet
Sona College of Technology (Autonomous) : U23IT953-Data Warehousing and Data Mining
128 pages
Session 2-Data Preprocessing
No ratings yet
Session 2-Data Preprocessing
29 pages
Data Integration & Transformation
No ratings yet
Data Integration & Transformation
14 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
Unit-2 Data Warehouse Notes
No ratings yet
Unit-2 Data Warehouse Notes
11 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
Chap 3
No ratings yet
Chap 3
26 pages
DM 02 04 Data Transformation
No ratings yet
DM 02 04 Data Transformation
52 pages
Study+Material+Unit 4+Data+Preprocessing+
No ratings yet
Study+Material+Unit 4+Data+Preprocessing+
8 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
20 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
5 Preprocessing
No ratings yet
5 Preprocessing
44 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Wa0003.
No ratings yet
Wa0003.
27 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Transformation in Data Mining
No ratings yet
Data Transformation in Data Mining
6 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
8 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
85 pages
633777800398832500ata Minig Presentation
No ratings yet
633777800398832500ata Minig Presentation
20 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Module 2 - DM - AI
No ratings yet
Module 2 - DM - AI
61 pages
Unit-3 Finalized
No ratings yet
Unit-3 Finalized
9 pages
Data Normalization and Aggregation
No ratings yet
Data Normalization and Aggregation
25 pages
Data Preprocessing
No ratings yet
Data Preprocessing
15 pages
DAI101 4 Data Preparation
No ratings yet
DAI101 4 Data Preparation
45 pages
Data Normalization
No ratings yet
Data Normalization
7 pages
FDS CH 3
No ratings yet
FDS CH 3
2 pages
Data Preparation.
No ratings yet
Data Preparation.
36 pages
Unit 2 Preprocessing
No ratings yet
Unit 2 Preprocessing
39 pages
Data Mining Lab Guide
No ratings yet
Data Mining Lab Guide
58 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Unit 1 C
No ratings yet
Unit 1 C
63 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
CH 3
No ratings yet
CH 3
68 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Exam On Reference Sources and Services
100% (1)
Exam On Reference Sources and Services
143 pages
Imtiyaz Pasha: Work Experience
No ratings yet
Imtiyaz Pasha: Work Experience
2 pages
Data Structures & Algorithms Intro
No ratings yet
Data Structures & Algorithms Intro
10 pages
Grade 8 - SE1 Worksheet
No ratings yet
Grade 8 - SE1 Worksheet
21 pages
A Recommender System: John Urbanic
No ratings yet
A Recommender System: John Urbanic
36 pages
Isc N-Channel MOSFET Transistor 2SK1011: INCHANGE Semiconductor Product Specification
100% (1)
Isc N-Channel MOSFET Transistor 2SK1011: INCHANGE Semiconductor Product Specification
2 pages
Web Programming Lab Guide
No ratings yet
Web Programming Lab Guide
27 pages
ML Logcat 1742825998561
No ratings yet
ML Logcat 1742825998561
64 pages
SAP BW Extraction
100% (3)
SAP BW Extraction
161 pages
IFMIS Workshop Day 1 Presentations
No ratings yet
IFMIS Workshop Day 1 Presentations
63 pages
Industrial Pipe Support Solutions
No ratings yet
Industrial Pipe Support Solutions
284 pages
DS Chapter V8.0fault Tolerance
No ratings yet
DS Chapter V8.0fault Tolerance
23 pages
Synopsis of Snake Game
No ratings yet
Synopsis of Snake Game
5 pages
DF-L04-Current Digital Forensics Tools
No ratings yet
DF-L04-Current Digital Forensics Tools
52 pages
Unit 5 C Language
100% (1)
Unit 5 C Language
15 pages
Quikr
No ratings yet
Quikr
7 pages
Re-Midterm Fall-2024
No ratings yet
Re-Midterm Fall-2024
2 pages
SKDH146 16 L75
No ratings yet
SKDH146 16 L75
2 pages
Tma 4G Lte
No ratings yet
Tma 4G Lte
4 pages
Unsupervised Learning & Clustering
No ratings yet
Unsupervised Learning & Clustering
41 pages
Scope of Work Q1 Civil Construction Work: IT Department and Data Center Preparation
No ratings yet
Scope of Work Q1 Civil Construction Work: IT Department and Data Center Preparation
2 pages
Online Exam System for Educators
No ratings yet
Online Exam System for Educators
2 pages
Manual Toto Link Wireless
No ratings yet
Manual Toto Link Wireless
37 pages
Mortal Kombat
No ratings yet
Mortal Kombat
2 pages
Contoh Application Job
No ratings yet
Contoh Application Job
4 pages
Broken Authentication
No ratings yet
Broken Authentication
31 pages
SF 30
No ratings yet
SF 30
2 pages
Product KPIs
No ratings yet
Product KPIs
39 pages
Dual Gate Mosfet Data Sheet PDF
No ratings yet
Dual Gate Mosfet Data Sheet PDF
8 pages
Motor Speed Control for Engineers
No ratings yet
Motor Speed Control for Engineers
74 pages

CH2 Data Integration - Transformation

Uploaded by

CH2 Data Integration - Transformation

Uploaded by

ARIN-2137

KNOWLEDGE DISCOVERY AND DATA

Run Data Mining

• What is data integration

Student_id (IUB’s database)

• Detecting and resolving data value conflicts

e.g., metric vs. British units | km vs. miles

• Careful integration of the data from multiple sources

100 27 RYK2 2214.50 New Record

112 25 RYK1 2105.99 100 27 RYK 2 2214.50

115 30 RYK2 5000.00 110 21 LHR 2 1999.00

112 25 RYK1 2105.99

Record B 115 30 RYK 2 5000.00

116 29 LHR B 4555.00

119 40 LHR 2 4998.00

Data Data Data Data Data

:: Transform data into appropriate form for mining

attribute data are scaled within

100 27 RYK 2 2214.50 100 Age 2 SP Sal 2

110 21 LHR 2 1999.00 110 Age 1 PUNJAB Sal 1

112 25 RYK 1 2105.99 112 Age 1 SP Sal 2

115 30 RYK 2 5000.00 115 Age 2 SP Sal 4

116 29 LHR B 4555.00 116 Age 2 PUNJAB Sal 4

119 40 LHR 2 4998.00 119 Age 3 PUNJAB Sal 4

• WHY ?: To increase processing speed and reduce memory allocation.

actual data - min

•Eg: if actual data range is 5 – 100 to be normalized

actual data - mean value

1. Normalize attribute ‘ForceApplied’ into the range of -1 and 1.

You might also like