0% found this document useful (0 votes)

150 views39 pages

Data Integration and Discretization

Data integration involves ingesting, transforming, and integrating data from various sources into a single data warehouse. It includes processes to extract data from sources, transform it by resolving inconsistencies and conflicts, and load it into a data warehouse. Some common issues in data integration are differing attribute names and units, redundant or derived attributes, and inconsistencies in entity values across sources. Techniques like correlation analysis, association analysis, binning, and clustering can be used to identify redundancies and resolve conflicts during integration.

Uploaded by

RIZKA FIDYA PERMATASARI 06211940005004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

150 views39 pages

Data Integration and Discretization

Uploaded by

RIZKA FIDYA PERMATASARI 06211940005004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

DATA INTEGRATION &

DATA DISCRETIZATION

Data Mining Irhamah

Data Integration
Data Integration ingests, transforms and integrates structured and unstructured data and
delivers the data to a scalable data warehouse platform using traditional ETL (Extract,
Transform, Load) tools and methodologies to collect of data from various sources into a
single data warehouse:
 Data ingestion is the process of obtaining and importing data for immediate use

or storage in a database. To ingest something is to "take something in or absorb

something." .
 Includes both technical processes and business logic to transform data from disparate

sources into cohesive meaningful data with quality, governance and compliance
considerations.
 Combination of technical and business processes used to combine data from disparate

sources into meaningful and valuable information. A complete data integration solution
delivers trusted data from a variety of sources.
 Traditional domain of ETL (Extract, Transform and Load) that transforms and cleans the

data as it is being extracted from various data sources and loaded into one data store
(data warehouse). For example, converting a single variable of ―address‖ into ―street
address‖, ―city‖, ―state‖ and ―zip code‖ fields.
Source: KDnuggets
Data Integration
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different
sources are different
 Possible reasons: different representations, different scales, e.g.,
metric vs. British units (misal berat dalam kg atau pounds)
Problem in Data Integration
 Nama atribut yang berbeda-beda

 Unit yang berbeda: Sales dalam $, sales dalam Yen

 Skala berbeda: Rp, Rp juta, Rp Milyar

 Atribut turunan : monthly salary dan annual salary
Problem in Data Integration (2)
 Customer dengan customer-id 150 punya 3 anak dalam
relation1 dan 4 anak dalam relation2

 Komputasi annual salary dari monthly salary dalam relation1

tidak cocok dengan atribut ―annual-salary‖ dalam relation2
Handling Redundancy in Data Integration
8

 Redundant data occur often when integration of multiple databases

 Object identification: The same attribute or object may have different
names in different databases
 Derivable data: One attribute may be a ―derived‖ attribute in another
table, e.g., annual revenue
 An attribute (column or feature of data set) is called redundant if it can
be derived from any other attribute or set of attributes. Inconsistencies in
attribute or dimension naming can also lead to the redundancies in data
set.
 Redundant attributes may be able to be detected by
correlation/covariance analysis or association analysis
 Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining speed
and quality
Correlation Analysis (Numerical Data)

 Correlation coefficient (also called Pearson’s product moment

coefficient) n

 ( x  x )( y  y )
i i
r i 1
n n

 (x  x )  ( y  y)
i 1
i
2

i 1
i
2

where n is the number of tuples, x and y are the respective means of X and Y.
 If rxy > 0, X and Y are positively correlated (X’s values increase
as Y’s). The higher, the stronger correlation.
 rxy = 0: independent; rxy < 0: negatively correlated

9
Association Analysis (Categorical Data)

The Chi-Square Test of Independence determines whether there is an

association between categorical variables (i.e., whether the variables are
independent or related). It is a nonparametric test.

Data Requirements
• Two categorical variables.
• Two or more categories (groups) for each variable.
• Independence of observations.
There is no relationship between the subjects in each group.
The categorical variables are not "paired" in any way (e.g. pre-
test/post-test observations).
• Relatively large sample size.
Expected frequencies for each cell are at least 1.
Expected frequencies should be at least 5 for the majority (80%) of the
cells.
10
Association Analysis (2)

The null hypothesis (H0) and alternative hypothesis (H1) of the Chi-Square
Test of Independence can be expressed in two different but equivalent
ways:
H0: "[Variable 1] is independent of [Variable 2]"
H1: "[Variable 1] is not independent of [Variable 2]"
OR
H0: "[Variable 1] is not associated with [Variable 2]"
H1: "[Variable 1] is associated with [Variable 2]―

11
Association Analysis (3)

12
Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are expected

counts calculated based on the data distribution in the two
categories)

( 250  90 ) 2
(50  210 ) 2
( 200  360 ) 2
(1000  840 ) 2
2      507.93
90 210 360 840
 It shows that like_science_fiction is associated with play_chess or
not.
Example of Data Redundancy
We have a data set having three attributes- person_name, is_male, is_female.
is_male is 1 if the corresponding person is a male else it is 0 .
is_female is 1 if the corresponding person is a female else it is 0.

On analysing the fact that if a person is not male

(i.e is_male is 0 corresponding the person_name) then,
the person is surely a female (since there are only two
value in output class- male and female). It implies that
the two attributes are highly correlated and one
attribute can determine the other. Hence, one of these
attributes became redundant. So one of these two
attributes can be dropped without any information loss.
Discretization (Diskritisasi)
15

Konsep sama dengan histogram

• Membagi domain dari suatu atribut numerik kedalam interval-interval.
• Menggantikan nilai atribut dengan label untuk interval.
Contoh:
– Dataset (age; salary):
(25;30,000),(30;80,000),(27;50,000),(60;70,000),(50;55,000),(28;25,
000)
– Dataset diskrit(age, discretizedSalary):
(25,low),(30,high),(27,medium),(60,high), (50,medium),(28,low)
Discretization
16

 Three types of attributes:

 Nominal — values from an unordered set, e.g., color, profession
 Ordinal — values from an ordered set, e.g., military or academic
rank
 Continuous — real numbers, e.g., integer or real numbers

 Discretization:
 Divide the range of a continuous attribute into intervals
 Some classification algorithms only accept categorical attributes.
 Reduce data size by discretization
 Prepare for further analysis
Discretization and Concept Hierarchy
17

 Discretization
 Reduce the number of values for a given continuous attribute by dividing the
range of the attribute into intervals
 Interval labels can then be used to replace actual data values
 Concept hierarchy formation
 Recursively reduce the data by collecting and replacing low level concepts
(such as numeric values for age) by higher level concepts (such as young,
middle-aged, or senior)
Discretization and Concept Hierarchy Generation for
Numeric Data
18

 Typical methods: All the methods can be applied recursively

 Binning
 Histogram analysis
 Clustering analysis
 Entropy-based discretization: supervised, top-down split
 Interval merging by 2 Analysis: unsupervised, bottom-up merge
 Segmentation by natural partitioning: top-down split, unsupervised
Binning
 Bins are numbers that represent the intervals into which you want to group
the source data (input data). The intervals must be consecutive, non-
overlapping and usually equal size.
Binning
Binning or discretization is the process of transforming numerical
variables into categorical counterparts.
An example is to bin values for Age into categories such as 20-39, 40-
59, and 60-79. Numerical variables are usually discretized in the
modeling methods based on frequency tables (e.g., decision trees).
Moreover, binning may improve accuracy of the predictive models by
reducing the noise or non-linearity. Finally, binning allows easy
identification of outliers, invalid and missing values of numerical
variables.
Unsupervised Binning
Unsupervised binning methods transform numerical variables into
categorical counterparts but do not use the target (class)
information. Equal Width and Equal Frequency are two
unsupervised binning methods.
1- Equal Width Binning
 The algorithm divides the data into k intervals of equal

size. The width of intervals is:w = (max-min)/k

 And the interval boundaries are:min+w, min+2w, ... , min+(k-
1)w
Unsupervised Binning (2)
2- Equal Frequency Binning
The algorithm divides the data into k groups which each group
contains approximately same number of values. For the both
methods, the best way of determining k is by looking at the
histogram and try different intervals or groups. Example:
Unsupervised Binning (3)
3- Other Methods
 Rank: The rank of a number is its size relative to other values of a
numerical variable. First, we sort the list of values, then we assign the
position of a value as its rank. Same values receive the same rank
but the presence of duplicate values affects the ranks of subsequent
values (e.g., 1,2,3,3,5). Rank is a solid binning method with one
major drawback, values can have different ranks in different lists.
 Quantiles (median, quartiles, percentiles, ...): Quantiles are also
very useful binning methods but like Rank, one value can have
different quantile if the list of values changes.
 Math functions: For example, FLOOR(LOG(X)) is an effective
binning method for the numerical variables with highly skewed
distribution (e.g., income).
Supervised Binning
 Supervised binning methods transform numerical variables into
categorical counterparts and refer to the target (class) information
when selecting discretization cut points. Entropy-based binning is an
example of a supervised binning method.
 Entropy based method uses a split approach. The entropy (or the
information content) is calculated based on the class label.
 The goal of this algorithm is to find the split with the maximum
information gain (so that the bins are as pure as possible that is the
majority of the values in a bin correspond to have the same class label)
 The boundary that minimizes the entropy over all possible boundaries is
selected
 The process is recursively applied to partitions obtained until some
stopping criterion is met
 Such a boundary may reduce data size and improve classification
accuracy
 Example: Discretize the temperature variable using
entropy-based binning algorithm.
 Step 1: Calculate "Entropy" for the target.
O-Ring Failure
Y N
7 17
E (Failure) = E(7, 17) = E(0.29, .71) = -0.29 x log2(0.29) - 0.71 x log2(0.71) = 0.871
Step 2: Calculate "Entropy" for the target given a bin (T is the value used to split S into S1&S2)
O-Ring Failure
Y N
<= 60 3 0
Temperature
> 60 4 17
E (Failure,Temperature) = P(<=60) x E(3,0) + P(>60) x E(4,17) = 3/24 x 0 + 21/24 x 0.7= 0.615
 Step 3: Calculate "Information Gain" given a bin.
 Difference in entropy between original set (S) and weighted
split (S1 + S2)

 Information Gain (Failure, Temperature) = 0.256

Entropy-Based Discretization
29

 Diberikan suatu himpunan sampel S, jika S dipartisi kedalam 2 interval S1 dan

S2 menggunakan batas T, information gain setelah partisi adalah
| S1 | |S |
I (S , T )  Entropy( S 1)  2 Entropy( S 2)
|S| |S|

 Entropy is calculated based on class distribution of the samples in the set. Given
m classes, the entropy of S1 is m
Entropy( S1 )   pi log 2 ( pi )
i 1

where pi is the probability of class i in S1

 The boundary that minimizes the entropy function over all possible boundaries is
selected as a binary discretization
 The process is recursively applied to partitions obtained until some stopping
criterion is met
 Such a boundary may reduce data size and improve classification accuracy
Interval Merge by 2 Analysis
30

 Merging-based (bottom-up) vs. splitting-based methods

 Merge: Find the best neighboring intervals and merge them to form larger
intervals recursively
 ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]
 Initially, each distinct value of a numerical attr. A is considered to be one
interval
 2 tests are performed for every pair of adjacent intervals
 Adjacent intervals with the least 2 values are merged together, since low 2
values for a pair indicate similar class distributions
 This merge process proceeds recursively until a predefined stopping criterion
is met (such as significance level, max-interval, max inconsistency, etc.)
Numeric Concept Hierarchy
 A concept hierarchy for a given numerical attribute
defines a discretization of the attribute
 Recursively reduce the data by collecting and
replacing low level concepts by higher level
concepts
Segmentation by Natural Partitioning
33

 A simply 3-4-5 rule can be used to segment numeric data into

relatively uniform, ―natural‖ intervals.
 If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equi-width
intervals
 If it covers 2, 4, or 8 distinct values at the most significant
digit, partition the range into 4 intervals
 If it covers 1, 5, or 10 distinct values at the most significant
digit, partition the range into 5 intervals
Data Mining: Concepts and Techniques March 25, 2021
Example of 3-4-5 Rule
34
count

Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$400 -$5,000)
Step 4:

(-$400 - 0) ($2,000 - $5, 000)

(0 - $1,000) ($1,000 - $2, 000)
(0 -
(-$400 - ($1,000 -
$200)
$1,200) ($2,000 -
-$300)
($200 - $3,000)
($1,200 -
(-$300 - $400)
$1,400)
-$200) ($3,000 -
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) $5,000)
($600 - ($1,600 -
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
Concept Hierarchy Generation for Categorical Data
35

 Specification of a partial/total ordering of attributes explicitly

at the schema level by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by explicit data
grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels) by the
analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state, country}
Automatic Concept Hierarchy Generation
36

 Some hierarchies can be automatically generated based on

the analysis of the number of distinct values per attribute in
the data set
 The attribute with the most distinct values is placed at the
lowest level of the hierarchy
 Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

Summary
37

 Data preparation or preprocessing is a big issue for both data

warehousing and data mining
 Discriptive data summarization is need for quality data
preprocessing
 Data preparation includes
 Data cleaning and data integration
 Data reduction and feature selection
 Discretization
 A lot a methods have been developed but data preprocessing
still an active area of research
Notes
 In real world applications, data preprocessing usually
occupies about 70% workload in a data mining task.
 Domain knowledge is usually required to do good data
preprocessing.
 To improve a predictive performance of a model
– Improve learning algorithms (different algorithms,
different parameters)
Most data mining research focuses on here
– Improve data quality ---- data preprocessing
Deserve more attention!
References
39

 D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications of ACM,
42:73-78, 1999
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
 T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining Database Structure; Or, How to Build a Data Quality
Browser. SIGMOD’02.
 H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data
Engineering, 20(4), December 1997
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of the Technical Committee
on Data Engineering. Vol.23, No.4
 V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and Transformation,
VLDB’2001
 T. Redman. Data Quality: Management and Technology. Bantam Books, 1992
 Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM,
39:86-95, 1996
 R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and
Data Engineering, 7:623-640, 1995

Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
IT326 - Ch3
No ratings yet
IT326 - Ch3
33 pages
DM Data Transformation Techniques
No ratings yet
DM Data Transformation Techniques
25 pages
Chapter3 DataPreprocessing
No ratings yet
Chapter3 DataPreprocessing
50 pages
Data Preprocessing - Updated
No ratings yet
Data Preprocessing - Updated
31 pages
CH 3-Final
No ratings yet
CH 3-Final
39 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Mining P5
No ratings yet
Data Mining P5
32 pages
CSC 452 DM Week05 Data PreProcessing B 13102020 015718pm
No ratings yet
CSC 452 DM Week05 Data PreProcessing B 13102020 015718pm
50 pages
2.4 DataIntegration and Transformation
No ratings yet
2.4 DataIntegration and Transformation
23 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Module 2 - DM - AI
No ratings yet
Module 2 - DM - AI
61 pages
Data Pre-Processing Essentials
No ratings yet
Data Pre-Processing Essentials
21 pages
Lecture 4-Data Preprocessing - Integration
No ratings yet
Lecture 4-Data Preprocessing - Integration
12 pages
Data Mining - Lecture 3
No ratings yet
Data Mining - Lecture 3
33 pages
UNIT II Data Processing (1) .PPTX DMT
No ratings yet
UNIT II Data Processing (1) .PPTX DMT
43 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
' 3 IT326 - Ch2 - Pre-Processing
No ratings yet
' 3 IT326 - Ch2 - Pre-Processing
48 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
2020 Preprocessing
No ratings yet
2020 Preprocessing
63 pages
CH 03-01 Data Preprocessing
No ratings yet
CH 03-01 Data Preprocessing
27 pages
Unit I
No ratings yet
Unit I
57 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
DP
No ratings yet
DP
44 pages
3 Processing
No ratings yet
3 Processing
79 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
40 pages
DM Merged
No ratings yet
DM Merged
169 pages
Module 2 (C) - Data Preprocessing
No ratings yet
Module 2 (C) - Data Preprocessing
50 pages
Data Preparation Modeling Evaluation
No ratings yet
Data Preparation Modeling Evaluation
145 pages
Stacked It
No ratings yet
Stacked It
28 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Kuliah 2 - Data Dan Eksplorasi Data
No ratings yet
Kuliah 2 - Data Dan Eksplorasi Data
61 pages
PPT1
No ratings yet
PPT1
93 pages
Week2 2
No ratings yet
Week2 2
25 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
80 pages
CH 3
No ratings yet
CH 3
68 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Data Mining and Preprocessing Guide
No ratings yet
Data Mining and Preprocessing Guide
40 pages
Data Mining Module 2 Important Topics PYQs
No ratings yet
Data Mining Module 2 Important Topics PYQs
35 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
52 pages
DWDM (Unit-4) - 2
No ratings yet
DWDM (Unit-4) - 2
23 pages
Ecn 4231: Intoduction To Econometrics Dummy Variable Lab
No ratings yet
Ecn 4231: Intoduction To Econometrics Dummy Variable Lab
6 pages
Rizka Widya Permatasari
No ratings yet
Rizka Widya Permatasari
2 pages
Model Evaluation Dan Selection
No ratings yet
Model Evaluation Dan Selection
22 pages
Case Study Willingham CPD
No ratings yet
Case Study Willingham CPD
1 page
Neural Metwork: Institut Teknologi Sepuluh Nopember (ITS) Surabaya - Indonesia
No ratings yet
Neural Metwork: Institut Teknologi Sepuluh Nopember (ITS) Surabaya - Indonesia
43 pages
Design of Experiments: Analisis Data-S1
No ratings yet
Design of Experiments: Analisis Data-S1
26 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Cox Proportional Hazard Model
No ratings yet
Cox Proportional Hazard Model
34 pages
Likelihood Ratio Tests: Instructor: Songfeng Zheng
No ratings yet
Likelihood Ratio Tests: Instructor: Songfeng Zheng
9 pages
R & SAS Intervention Analysis Guide
No ratings yet
R & SAS Intervention Analysis Guide
4 pages
Hypothesis Testing: Concepts and Simple Examples
No ratings yet
Hypothesis Testing: Concepts and Simple Examples
16 pages
SAE Power: Custom Power Solutions
No ratings yet
SAE Power: Custom Power Solutions
8 pages
Customer Satisfaction in The Service Industry-Case Study of Telecommunication Industry in Albania
No ratings yet
Customer Satisfaction in The Service Industry-Case Study of Telecommunication Industry in Albania
8 pages
Kibabii University
0% (1)
Kibabii University
29 pages
XA400 - TerexPegson Specs PDF
100% (4)
XA400 - TerexPegson Specs PDF
14 pages
LANDBANK Student Loan
No ratings yet
LANDBANK Student Loan
4 pages
Automated Glaucoma Detection Using Support Vector Machine Classification Method
No ratings yet
Automated Glaucoma Detection Using Support Vector Machine Classification Method
13 pages
Fire Dynamics for Engineering Students
0% (1)
Fire Dynamics for Engineering Students
28 pages
What Is A Concept Paper?
No ratings yet
What Is A Concept Paper?
1 page
English Proficiency Guide
No ratings yet
English Proficiency Guide
31 pages
Standard 44 Standards and Acceptance Checklist Well Operations
100% (1)
Standard 44 Standards and Acceptance Checklist Well Operations
25 pages
Type U' Bops: Shear Bonnet Identification
No ratings yet
Type U' Bops: Shear Bonnet Identification
4 pages
PT9 C-PROOF ULD Installation Guide
No ratings yet
PT9 C-PROOF ULD Installation Guide
1 page
Karakteristik Tanah Timbun Sebagai Pengganti: Subgrade Di Lahan Gambut
No ratings yet
Karakteristik Tanah Timbun Sebagai Pengganti: Subgrade Di Lahan Gambut
5 pages
Calibration Requirement For Lux Meter RF Transducer Technical Guide 1 29 Mar 2019
No ratings yet
Calibration Requirement For Lux Meter RF Transducer Technical Guide 1 29 Mar 2019
118 pages
Importance of Quantitative Research Across Fields
No ratings yet
Importance of Quantitative Research Across Fields
3 pages
Comprehensive Reading Strategies
60% (10)
Comprehensive Reading Strategies
3 pages
Rivers Pt4 - River Features
No ratings yet
Rivers Pt4 - River Features
4 pages
Pi - Titan Utto To-4 Sae 60 - 08.22
No ratings yet
Pi - Titan Utto To-4 Sae 60 - 08.22
3 pages
Photography Manual 10-19
100% (1)
Photography Manual 10-19
7 pages
Literary Madness Explored
No ratings yet
Literary Madness Explored
5 pages
Course Reflection HLTH 1050
No ratings yet
Course Reflection HLTH 1050
2 pages
Ingles 1er Bchto U 1-5
No ratings yet
Ingles 1er Bchto U 1-5
40 pages
Praktical Translation 1
100% (1)
Praktical Translation 1
3 pages
Keplers Laws of Planetary Motion
No ratings yet
Keplers Laws of Planetary Motion
18 pages
Top 10 MBA Entrance Exams Guide
No ratings yet
Top 10 MBA Entrance Exams Guide
29 pages
Đề Thi Chính Thức Vào 10 Môn Anh Năm Học 2022 2023 TP. Hồ Chí Minh
100% (1)
Đề Thi Chính Thức Vào 10 Môn Anh Năm Học 2022 2023 TP. Hồ Chí Minh
5 pages
Rigaku Progeny Spec Sheet LTR 3.14
100% (1)
Rigaku Progeny Spec Sheet LTR 3.14
2 pages
Quarter 1 - Activity 8
No ratings yet
Quarter 1 - Activity 8
5 pages
Multiple Regression Edit - Removed
No ratings yet
Multiple Regression Edit - Removed
14 pages
A Student S Guide To Fourier Transforms With Applications in Physics and Engineering 3rd Ed 2011 3rd Edition J. F. James
100% (9)
A Student S Guide To Fourier Transforms With Applications in Physics and Engineering 3rd Ed 2011 3rd Edition J. F. James
21 pages

Data Integration and Discretization

Uploaded by

Data Integration and Discretization

Uploaded by

DATA INTEGRATION &

Data Mining Irhamah

or storage in a database. To ingest something is to "take something in or absorb

 Unit yang berbeda: Sales dalam $, sales dalam Yen

 Skala berbeda: Rp, Rp juta, Rp Milyar

 Komputasi annual salary dari monthly salary dalam relation1

 Redundant data occur often when integration of multiple databases

 Correlation coefficient (also called Pearson’s product moment

The Chi-Square Test of Independence determines whether there is an

Play chess Not play chess Sum (row)

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are expected

On analysing the fact that if a person is not male

Konsep sama dengan histogram

 Three types of attributes:

 Typical methods: All the methods can be applied recursively

size. The width of intervals is:w = (max-min)/k

 Information Gain (Failure, Temperature) = 0.256

 Diberikan suatu himpunan sampel S, jika S dipartisi kedalam 2 interval S1 dan

where pi is the probability of class i in S1

 Merging-based (bottom-up) vs. splitting-based methods

 A simply 3-4-5 rule can be used to segment numeric data into

Step 1: -$351 -$159 profit $1,838 $4,700

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$400 - 0) ($2,000 - $5, 000)

 Specification of a partial/total ordering of attributes explicitly

 Some hierarchies can be automatically generated based on

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

 Data preparation or preprocessing is a big issue for both data

You might also like