DATA INTEGRATION &
DATA DISCRETIZATION
Data Mining Irhamah
Data Integration
Data Integration ingests, transforms and integrates structured and unstructured data and
delivers the data to a scalable data warehouse platform using traditional ETL (Extract,
Transform, Load) tools and methodologies to collect of data from various sources into a
single data warehouse:
Data ingestion is the process of obtaining and importing data for immediate use
or storage in a database. To ingest something is to "take something in or absorb
something." .
Includes both technical processes and business logic to transform data from disparate
sources into cohesive meaningful data with quality, governance and compliance
considerations.
Combination of technical and business processes used to combine data from disparate
sources into meaningful and valuable information. A complete data integration solution
delivers trusted data from a variety of sources.
Traditional domain of ETL (Extract, Transform and Load) that transforms and cleans the
data as it is being extracted from various data sources and loaded into one data store
(data warehouse). For example, converting a single variable of ―address‖ into ―street
address‖, ―city‖, ―state‖ and ―zip code‖ fields.
Source: KDnuggets
Data Integration
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales, e.g.,
metric vs. British units (misal berat dalam kg atau pounds)
Problem in Data Integration
Nama atribut yang berbeda-beda
Unit yang berbeda: Sales dalam $, sales dalam Yen
Skala berbeda: Rp, Rp juta, Rp Milyar
Atribut turunan : monthly salary dan annual salary
Problem in Data Integration (2)
Customer dengan customer-id 150 punya 3 anak dalam
relation1 dan 4 anak dalam relation2
Komputasi annual salary dari monthly salary dalam relation1
tidak cocok dengan atribut ―annual-salary‖ dalam relation2
Handling Redundancy in Data Integration
8
Redundant data occur often when integration of multiple databases
Object identification: The same attribute or object may have different
names in different databases
Derivable data: One attribute may be a ―derived‖ attribute in another
table, e.g., annual revenue
An attribute (column or feature of data set) is called redundant if it can
be derived from any other attribute or set of attributes. Inconsistencies in
attribute or dimension naming can also lead to the redundancies in data
set.
Redundant attributes may be able to be detected by
correlation/covariance analysis or association analysis
Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining speed
and quality
Correlation Analysis (Numerical Data)
Correlation coefficient (also called Pearson’s product moment
coefficient) n
( x x )( y y )
i i
r i 1
n n
(x x ) ( y y)
i 1
i
2
i 1
i
2
where n is the number of tuples, x and y are the respective means of X and Y.
If rxy > 0, X and Y are positively correlated (X’s values increase
as Y’s). The higher, the stronger correlation.
rxy = 0: independent; rxy < 0: negatively correlated
9
Association Analysis (Categorical Data)
The Chi-Square Test of Independence determines whether there is an
association between categorical variables (i.e., whether the variables are
independent or related). It is a nonparametric test.
Data Requirements
• Two categorical variables.
• Two or more categories (groups) for each variable.
• Independence of observations.
There is no relationship between the subjects in each group.
The categorical variables are not "paired" in any way (e.g. pre-
test/post-test observations).
• Relatively large sample size.
Expected frequencies for each cell are at least 1.
Expected frequencies should be at least 5 for the majority (80%) of the
cells.
10
Association Analysis (2)
The null hypothesis (H0) and alternative hypothesis (H1) of the Chi-Square
Test of Independence can be expressed in two different but equivalent
ways:
H0: "[Variable 1] is independent of [Variable 2]"
H1: "[Variable 1] is not independent of [Variable 2]"
OR
H0: "[Variable 1] is not associated with [Variable 2]"
H1: "[Variable 1] is associated with [Variable 2]―
11
Association Analysis (3)
12
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
Χ2 (chi-square) calculation (numbers in parenthesis are expected
counts calculated based on the data distribution in the two
categories)
( 250 90 ) 2
(50 210 ) 2
( 200 360 ) 2
(1000 840 ) 2
2 507.93
90 210 360 840
It shows that like_science_fiction is associated with play_chess or
not.
Example of Data Redundancy
We have a data set having three attributes- person_name, is_male, is_female.
is_male is 1 if the corresponding person is a male else it is 0 .
is_female is 1 if the corresponding person is a female else it is 0.
On analysing the fact that if a person is not male
(i.e is_male is 0 corresponding the person_name) then,
the person is surely a female (since there are only two
value in output class- male and female). It implies that
the two attributes are highly correlated and one
attribute can determine the other. Hence, one of these
attributes became redundant. So one of these two
attributes can be dropped without any information loss.
Discretization (Diskritisasi)
15
Konsep sama dengan histogram
• Membagi domain dari suatu atribut numerik kedalam interval-interval.
• Menggantikan nilai atribut dengan label untuk interval.
Contoh:
– Dataset (age; salary):
(25;30,000),(30;80,000),(27;50,000),(60;70,000),(50;55,000),(28;25,
000)
– Dataset diskrit(age, discretizedSalary):
(25,low),(30,high),(27,medium),(60,high), (50,medium),(28,low)
Discretization
16
Three types of attributes:
Nominal — values from an unordered set, e.g., color, profession
Ordinal — values from an ordered set, e.g., military or academic
rank
Continuous — real numbers, e.g., integer or real numbers
Discretization:
Divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical attributes.
Reduce data size by discretization
Prepare for further analysis
Discretization and Concept Hierarchy
17
Discretization
Reduce the number of values for a given continuous attribute by dividing the
range of the attribute into intervals
Interval labels can then be used to replace actual data values
Concept hierarchy formation
Recursively reduce the data by collecting and replacing low level concepts
(such as numeric values for age) by higher level concepts (such as young,
middle-aged, or senior)
Discretization and Concept Hierarchy Generation for
Numeric Data
18
Typical methods: All the methods can be applied recursively
Binning
Histogram analysis
Clustering analysis
Entropy-based discretization: supervised, top-down split
Interval merging by 2 Analysis: unsupervised, bottom-up merge
Segmentation by natural partitioning: top-down split, unsupervised
Binning
Bins are numbers that represent the intervals into which you want to group
the source data (input data). The intervals must be consecutive, non-
overlapping and usually equal size.
Binning
Binning or discretization is the process of transforming numerical
variables into categorical counterparts.
An example is to bin values for Age into categories such as 20-39, 40-
59, and 60-79. Numerical variables are usually discretized in the
modeling methods based on frequency tables (e.g., decision trees).
Moreover, binning may improve accuracy of the predictive models by
reducing the noise or non-linearity. Finally, binning allows easy
identification of outliers, invalid and missing values of numerical
variables.
Unsupervised Binning
Unsupervised binning methods transform numerical variables into
categorical counterparts but do not use the target (class)
information. Equal Width and Equal Frequency are two
unsupervised binning methods.
1- Equal Width Binning
The algorithm divides the data into k intervals of equal
size. The width of intervals is:w = (max-min)/k
And the interval boundaries are:min+w, min+2w, ... , min+(k-
1)w
Unsupervised Binning (2)
2- Equal Frequency Binning
The algorithm divides the data into k groups which each group
contains approximately same number of values. For the both
methods, the best way of determining k is by looking at the
histogram and try different intervals or groups. Example:
Unsupervised Binning (3)
3- Other Methods
Rank: The rank of a number is its size relative to other values of a
numerical variable. First, we sort the list of values, then we assign the
position of a value as its rank. Same values receive the same rank
but the presence of duplicate values affects the ranks of subsequent
values (e.g., 1,2,3,3,5). Rank is a solid binning method with one
major drawback, values can have different ranks in different lists.
Quantiles (median, quartiles, percentiles, ...): Quantiles are also
very useful binning methods but like Rank, one value can have
different quantile if the list of values changes.
Math functions: For example, FLOOR(LOG(X)) is an effective
binning method for the numerical variables with highly skewed
distribution (e.g., income).
Supervised Binning
Supervised binning methods transform numerical variables into
categorical counterparts and refer to the target (class) information
when selecting discretization cut points. Entropy-based binning is an
example of a supervised binning method.
Entropy based method uses a split approach. The entropy (or the
information content) is calculated based on the class label.
The goal of this algorithm is to find the split with the maximum
information gain (so that the bins are as pure as possible that is the
majority of the values in a bin correspond to have the same class label)
The boundary that minimizes the entropy over all possible boundaries is
selected
The process is recursively applied to partitions obtained until some
stopping criterion is met
Such a boundary may reduce data size and improve classification
accuracy
Example: Discretize the temperature variable using
entropy-based binning algorithm.
Step 1: Calculate "Entropy" for the target.
O-Ring Failure
Y N
7 17
E (Failure) = E(7, 17) = E(0.29, .71) = -0.29 x log2(0.29) - 0.71 x log2(0.71) = 0.871
Step 2: Calculate "Entropy" for the target given a bin (T is the value used to split S into S1&S2)
O-Ring Failure
Y N
<= 60 3 0
Temperature
> 60 4 17
E (Failure,Temperature) = P(<=60) x E(3,0) + P(>60) x E(4,17) = 3/24 x 0 + 21/24 x 0.7= 0.615
Step 3: Calculate "Information Gain" given a bin.
Difference in entropy between original set (S) and weighted
split (S1 + S2)
Information Gain (Failure, Temperature) = 0.256
Entropy-Based Discretization
29
Diberikan suatu himpunan sampel S, jika S dipartisi kedalam 2 interval S1 dan
S2 menggunakan batas T, information gain setelah partisi adalah
| S1 | |S |
I (S , T ) Entropy( S 1) 2 Entropy( S 2)
|S| |S|
Entropy is calculated based on class distribution of the samples in the set. Given
m classes, the entropy of S1 is m
Entropy( S1 ) pi log 2 ( pi )
i 1
where pi is the probability of class i in S1
The boundary that minimizes the entropy function over all possible boundaries is
selected as a binary discretization
The process is recursively applied to partitions obtained until some stopping
criterion is met
Such a boundary may reduce data size and improve classification accuracy
Interval Merge by 2 Analysis
30
Merging-based (bottom-up) vs. splitting-based methods
Merge: Find the best neighboring intervals and merge them to form larger
intervals recursively
ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]
Initially, each distinct value of a numerical attr. A is considered to be one
interval
2 tests are performed for every pair of adjacent intervals
Adjacent intervals with the least 2 values are merged together, since low 2
values for a pair indicate similar class distributions
This merge process proceeds recursively until a predefined stopping criterion
is met (such as significance level, max-interval, max inconsistency, etc.)
Numeric Concept Hierarchy
A concept hierarchy for a given numerical attribute
defines a discretization of the attribute
Recursively reduce the data by collecting and
replacing low level concepts by higher level
concepts
Segmentation by Natural Partitioning
33
A simply 3-4-5 rule can be used to segment numeric data into
relatively uniform, ―natural‖ intervals.
If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equi-width
intervals
If it covers 2, 4, or 8 distinct values at the most significant
digit, partition the range into 4 intervals
If it covers 1, 5, or 10 distinct values at the most significant
digit, partition the range into 5 intervals
Data Mining: Concepts and Techniques March 25, 2021
Example of 3-4-5 Rule
34
count
Step 1: -$351 -$159 profit $1,838 $4,700
Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000
(-$1,000 - $2,000)
Step 3:
(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)
(-$400 -$5,000)
Step 4:
(-$400 - 0) ($2,000 - $5, 000)
(0 - $1,000) ($1,000 - $2, 000)
(0 -
(-$400 - ($1,000 -
$200)
$1,200) ($2,000 -
-$300)
($200 - $3,000)
($1,200 -
(-$300 - $400)
$1,400)
-$200) ($3,000 -
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) $5,000)
($600 - ($1,600 -
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
Concept Hierarchy Generation for Categorical Data
35
Specification of a partial/total ordering of attributes explicitly
at the schema level by users or experts
street < city < state < country
Specification of a hierarchy for a set of values by explicit data
grouping
{Urbana, Champaign, Chicago} < Illinois
Specification of only a partial set of attributes
E.g., only street < city, not others
Automatic generation of hierarchies (or attribute levels) by the
analysis of the number of distinct values
E.g., for a set of attributes: {street, city, state, country}
Automatic Concept Hierarchy Generation
36
Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
The attribute with the most distinct values is placed at the
lowest level of the hierarchy
Exceptions, e.g., weekday, month, quarter, year
country 15 distinct values
province_or_ state 365 distinct values
city 3567 distinct values
street 674,339 distinct values
Summary
37
Data preparation or preprocessing is a big issue for both data
warehousing and data mining
Discriptive data summarization is need for quality data
preprocessing
Data preparation includes
Data cleaning and data integration
Data reduction and feature selection
Discretization
A lot a methods have been developed but data preprocessing
still an active area of research
Notes
In real world applications, data preprocessing usually
occupies about 70% workload in a data mining task.
Domain knowledge is usually required to do good data
preprocessing.
To improve a predictive performance of a model
– Improve learning algorithms (different algorithms,
different parameters)
Most data mining research focuses on here
– Improve data quality ---- data preprocessing
Deserve more attention!
References
39
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications of ACM,
42:73-78, 1999
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining Database Structure; Or, How to Build a Data Quality
Browser. SIGMOD’02.
H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data
Engineering, 20(4), December 1997
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of the Technical Committee
on Data Engineering. Vol.23, No.4
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and Transformation,
VLDB’2001
T. Redman. Data Quality: Management and Technology. Bantam Books, 1992
Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM,
39:86-95, 1996
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and
Data Engineering, 7:623-640, 1995