0% found this document useful (0 votes)

52 views61 pages

DM-2Preprocessing 2

Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for further analysis. It addresses issues like missing data, inconsistent data formats, and errors. Techniques for data preprocessing include data cleaning to fill in or correct missing and noisy values, data integration to combine data from multiple sources, and data reduction to reduce the overall size of the data set for modeling and storage. Descriptive statistics like mean, median, and standard deviation are also used to summarize and explore the characteristics of the data.

Uploaded by

Meskerem Wendafrash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views61 pages

DM-2Preprocessing 2

Uploaded by

Meskerem Wendafrash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 61

Data Warehousing and Data Mining

Data Preprocessing

2
Content

 Why Data Preprocessing?

 Descriptive data summarization

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

3
Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
 E.g., Occupation=“ ” (missing data)
 noisy: containing errors or outliers that deviates from the expected.
 E.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records

 No quality data, no quality mining results!

 Quality decisions must be based on quality data
 Data warehouse needs consistent integration of quality data

4
Data Quality Measures
 A well-accepted multidimensional data quality
measures are the following:
 Accuracy (No errors, no outliers)
 Reasons for inaccurate data: fault in device, human error during
entry, users may submit incorrect data(E.g. Jan 1 for birthday),
etc.
 Completeness (no missing values)
 Consistency (no inconsistent values and attributes)
 Timeliness (appropriateness)
 Believability (acceptability)
 Interpretability (easy to understand)

5
Descriptive data summarization
 Descriptive summary about data can be generated with
the help of measure of central tendency of the data
and dispersion of the data
 Measure of central tendency includes
 Mean
 Median
 Mode
 Mid-Range

 Measure of dispersion includes

 range
 The five number summary (based on Quartiles)
 Interquartile range (IQR)
 Standard deviation
6
Mean
 The mean is the sum of the values, divided by the
total number of values.

 Appropriate for data distributed normally

 Mean is the most important quantity for describing
dataset but it is sensitive to extreme values of an
attribute (example outliers)
 E.g. Find the mean: 20, 26, 40, 36, 23, 42, 35, 24, 30

7
Median
 The median is the halfway point in a data set. Before
you can find this point, the data must be arranged in
order. When the data set is ordered, it is called a data
array.
 E.g. The number of rooms in the seven hotels in
downtown Pittsburgh is 713, 300, 618, 595, 311, 401,
and 292. Find the median.

8
Mode
 The mode is the value that occurs most often in the
data set. It is sometimes said to be the most typical
case.
 A data set that has only one value that occurs with the
greatest frequency is said to be unimodal.
 E.g. Find the mode of the signing bonuses of eight NFL
players for a specific year. The bonuses in millions of
dollars are 18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10
 Since $10 million occurred 3 times—a frequency larger
than any other number—the mode is $10 million.

9
Mid-Range
 The midrange is a rough estimate of the middle. It is
found by adding the lowest and highest values in the
data set and dividing by 2. It is a very rough estimate
of the average and can be affected by one extremely
high or low value.

 E.g. In the last two winter seasons, the city of

Brownsville, Minnesota, reported these numbers of
water-line breaks per month. Find the midrange: 2, 3,
6, 8, 4, 1

10
Range
 The range is the highest value minus the lowest
value. The symbol R is used for the range.
 R = highest value - lowest value

11
Standard deviation
 The variance is the average of the squares of the distance each
value is from the mean. The formula for the population variance
is

 where
 X individual value
 population mean
 N population size

 The standard deviation is the square root of the variance.

12
Example
 Find the variance and standard deviation for brand B. The
months were: 35, 45, 30, 35, 40, 25

13
Quartiles
 Quartiles divide the distribution into four groups,
separated by Q1, Q2, Q3.
 Finding Data Values Corresponding to Q1, Q2, and
Q3
 Step 1 Arrange the data in order from lowest to highest.
 Step 2 Find the median of the data values. This is the value
for Q2.
 Step 3 Find the median of the data values that fall below Q2.
This is the value for Q1.
 Step 4 Find the median of the data values that fall above Q2.
This is the value for Q3.

14
Quartiles: Example
 Find Q1, Q2, and
Q3 for the data set
15, 13, 6, 5, 12,
50, 22, 18.

15
IQR and Outliers
 Interquartile range (IQR) is defined as the difference
between Q1 and Q3
 is used to identify outlier, extremely high or an extremely low
data value when compared with the rest of the data values.

 Check the data set for any data value that is greater
than Q3 + 1.5IQR and bellow Q1-1.5IQR
 For the previous example data
 Q3 + 1.5*IQR = 36.5 and
 Q1-1.5IQR = -7.5

 50 is outside this interval; hence, it can be considered

an outlier.

16
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation

17
Data Cleaning: Missing Data
 Causes for missing data
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to lack of understanding
 certain data may not be considered important at the time of entry and
hence left blank
 not register history or changes of the data

 Missing data may need to be inferred.

18
Missing Data Example

Name SSN Address Phone # Date Acct Total

John Doe 111-22-3333 1 Main St 111-222-3333 2/12/1999 2200.12

Bedford,
Ma
John W. Doe Bedford, 7/15/2000 12000.54
Ma
John Doe 111-22-3333 8/22/2001 2000.33

James Smith 222-33-4444 2 Oak St 222-333-4444 12/22/2002 15333.22

Boston, Ma
Jim Smith 222-33-4444 2 Oak St 222-333-4444 12333.66
Boston, Ma
Jim Smith 222-33-4444 2 Oak St 222-333-4444
Boston, Ma

19
How to Handle Missing Data?
 Ignore the tuple: not effective when the percentage of
missing values per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class:
smarter
 the most probable value: inference-based such as Bayesian
formula or decision tree

20
Data Cleaning: Noisy Data
 Noise: random error or variance in a measured
variable
 Incorrect attribute values may be due to
 faulty data collection instruments (e.g. OCR)
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention

21
Data Cleaning: How to catch Noisy Data
 Manually check all data : tedious + infeasible?
 Sort data by frequency
 ‘green’ is more frequent than ‘rgeen’
 Works well for categorical data

 Use, say Numerical constraints to Catch Corrupt Data

 Weight can’t be negative
 People can’t have more than 2 parents
 Salary can’t be less than Birr 300

 Use statistical techniques to Catch Corrupt Data

 Check for outliers (the case of the 8 meters man)
 Check for correlated outliers using n-gram (“pregnant male”)
 People can be male
 People can be pregnant
 People can’t be male AND pregnant

22
How to Handle Noisy Data?
 Binning
 first sort data and partition into bins
 Choose the number of bins (N) and do binning
 The bins can be equal-depth or equal-width
 then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.

 Regression
 smooth by fitting the data into regression functions

 Clustering
 detect and remove outliers

 Combined computer and human inspection

 detect suspicious values and check by human

23
Binning
 Equal-width (distance) partitioning:
 It divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B-A)/N.
 The most straightforward
 But outliers may dominate presentation
 Skewed data is not handled well.
 Equal-depth (frequency) partitioning:
 It divides the range into N intervals, each containing
approximately same number of samples
 Good data scaling
 Managing categorical attributes can be tricky.

24
Equal-width Example
 Given the data set (say 24, 21, 28, 8, 4, 26, 34, 21,
29, 15, 9, 25)
 Determine the number of bins N (say 3)
 Sort the data as 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
 Determine the range R = Max – Min = 30
 Divide the range into N equal width where the ith bin is [Xi-1,
Xi) where X0=Min and XN=Max and Xi = Xi-1 + R/N (R/N=10)
 Hence X0= 4, X1 = 14, X2 = 24, and X3 = 34
 Therefore:
 Bin 1 = 4,8,9
 Bin 2 = 15, 21, 21
 Bin3 = 24, 25, 26, 28, 29, 34

25
Equal-depth partitioning Example
 It divides the range into N intervals, each containing
approximately same number of samples
 Given the data set (say 24, 21, 28, 8, 4, 26, 34, 21,
29, 15, 9, 25)
 Determine the number of bins : N (say 3)
 Determine the number of data elements F(F=12)
 Sort the data as 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
 Place F/N (12/3 = 4) element in order into the different bins
 Therefore:
 Bin 1 = 4,8,9 ,15
 Bin 2 = 21, 21,24, 25
 Bin3 = 26, 28, 29, 34

26
Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
 Partition into (equal-frequency) bins:
- Bin 1: 4, 8, 15
- Bin 2: 21, 21, 24
- Bin 3: 25, 28, 34
 Smoothing by bin means:
- Bin 1: 9, 9, 9
- Bin 2: 22, 22, 22
- Bin 3: 29, 29, 29
 Smoothing by bin boundaries(Each bin value is replaced by the closest
boundary value) :
- Bin 1: 4, 4, 15
- Bin 2: 21, 21, 24
- Bin 3: 25, 25, 34

27
Activity
 Suppose a group of 12 sales price records has been
sorted as follows: 5, 10, 11, 13, 15, 35, 50, 55, 72, 92,
204, 215
 Partition them into three bins by each of the following
methods:
 (a) equal-frequency (equidepth) partitioning
 (b) equal-width partitioning

28
Handling Noisy Data by Regression
 Smooth by fitting the data into regression functions
 Finding a fitting function for two a variable using its relation with
another variable(s):
 In this way, the missing value of the first variable can be predicted
from the fitting function
Y
Dependent variable

y =F(x)=x+1
y1

x1 x

29
Example: Clustering

30
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)
 Check field overloading
 Check uniqueness rule, consecutive rule and null rule
 Use commercial tools
 Data scrubbing: use simple domain knowledge (e.g., postal code,
spell-check) to detect errors and make corrections
 Data auditing: by analyzing data to discover rules and relationship to
detect violators (e.g., correlation and clustering to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)

31
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store

 Because of the use of different sources, data that is fine on

its own may become problematic when we want to integrate
it.
 Some of the issues are:
 Different formats and structures
 Data at different levels
 Conflicting and redundant data

 Careful integration of the data from multiple sources may

help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality

32
Data Integration: Formats
 Not everyone uses the same format.
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources

 Dates are especially problematic:

 12/19/97
 19/12/97
 19/12/1997
 19-12-97
 Dec 19, 1997
 19 December 1997
 19th Dec. 1997
 Are you frequently writing money as:
 Birr 200, Br. 200, 200 Birr, …

33
Data Integration: different structure

ID Name City State

Ministry of
1234 Transportation Addis Ababa AA

ID Name City State

GCR34 Ministry of Finance Addis Ababa AA

Name ID City State

Office of Foreign Affairs

GCR34 Addis Ababa AA
34
Data Integration: Data that Moves
 Be careful of taking snapshots of a moving target
 Example: Let’s say you want to store the price of a
shoe in France, and the price of a shoe in Italy. Can
we use same currency (say, US$) or country’s
currency?
 You can’t store it all in the same currency (say, US$) because
the exchange rate changes
 Price in foreign currency stays the same
 Must keep the data in foreign currency and use the current
exchange rate to convert

 The same needs to be done for ‘Age’

 It is better to store ‘Date of Birth’ than ‘Age’

35
Data at different level of detail than needed
If it is at a finer level of detail, you can sometimes bin it
Example
 I need age ranges of 20-30, 30-40, 40-50, etc.
 Imported data contains birth date
 No problem! Divide data into appropriate categories
Sometimes you cannot bin it
Example
 I need age ranges 20-30, 30-40, 40-50 etc.
 Data is of age ranges 25-35, 35-45, etc.
 What to do?
 Ignore age ranges because you aren’t sure
 Make educated guess based on imported data (e.g., assume that # people
of age 25-35 are average # of people of age 20-30 & 30-40)

36
Data Integration: Conflicting Data
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different sources are
different
 Possible reasons: different representations, different scales, e.g., American
vs. British units
 weight measurement: KG or pound
 Height measurement: meter or inch

 Information source #1 says that Hussen lives in Dire Dawa

 Information source #2 says that Hussen lives in Harar

 What to do?
 Use both (He lives in both places)
 Use the most recently updated piece of information
 Use the “most trusted” information
 Flag row to be investigated further by hand
 Use neither (We’d rather be incomplete than wrong)

37
Data Integration: Avoiding
redundancy issue
 Redundant data occur often during integration of
multiple databases
 The same attribute may have different names in different
databases
 One attribute may be a “derived” attribute in another table,
e.g., annual revenue from monthly revenue

 Redundant data may be able to be detected by

correlation analysis and covariance analysis

38
Correlation Analysis (Numeric Data)
 Correlation coefficient

i1 (ai  A)(bi  B) 

n n
(ai bi )  n AB
rA, B   i 1
(n  1) A B (n  1) A B
where n is the number of tuples, A and B are the respective
means of A and B, σA and σB are the respective standard
deviation of A and B, and Σ(aibi) is the sum of the AB cross-
product.

 If rA,B > 0, A and B are positively correlated (A’s values

increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated

39
Covariance
 Covariance
Covariance is similar to correlation

where nn isisthe
thenumber
numberofoftuples,
tuples, and
p and are
q the
arerespective mean
the respective
of
meanp and
of qp and q
 It can be
be simplified
simplified in
in computation
computationas:
as:

 Positive
Positive covariance:
covariance: IfIf Cov
Covp,q > 0, then p and q both
both tend
tend to be
 p,q > 0, then p and q to be
directly
directly related.
related.
 Negative
Negative covariance:
covariance: IfIf Cov
Covp,q < 0 then p and q are
are inversely
inversely
 p,q < 0 then p and q
related.
related.
 Independence:
Independence: Cov Covp,q =0
 p,q = 0

40
Example

 Suppose two stocks A and B have the following values

in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).

 Question: If the stocks are affected by the same

industry trends, will their prices rise or fall together?
 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

 Thus, A and B rise together since Cov(A, B) > 0.

41
Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data set that is
much smaller in volume but yet produces the same (or almost the same)
analytical results
 Why data reduction? — A database/data warehouse may store terabytes
of data. Complex data analysis may take a very long time to run on the
complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant attributes
 Wavelet transforms
 Principal Components Analysis (PCA)
 Attribute subset Selection
 Numerosity reduction (some simply call it: Data Reduction)
 Regression and Log-Linear Models
 Histograms, clustering, sampling
 Data cube aggregation
 Data compression

42
Data Reduction: Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly
sparse

 Dimensionality reduction
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization

43
Attribute Subset Selection
 Redundant attributes
 Duplicate much or all of the information contained in one or more other
attributes
 E.g., purchase price of a product and the amount of sales tax paid

 Irrelevant attributes
 Contain no information that is useful for the data mining task at hand
 E.g., students' ID is often irrelevant to the task of predicting students'
GPA
 Problem of irrelevant attributes: causing confusion for the mining
algorithm employed
 Consequence: poor quality patterns, can slow down the mining process, etc.

 The “best” (and “worst”) attributes are typically determined using

tests of statistical significance

44
Heuristic Search in Attribute Selection
 There are 2d possible attribute combinations of d attributes
 Typical heuristic attribute selection methods:
 step-wise forward selection
 step-wise backward elimination
 combining forward selection and backward elimination
 decision-tree induction algorithm

 Step-wise forward selection

 Start with empty set
 The best single-feature is picked first
 Then next best feature will be selected conditioned by the first, ...
 Stop when the selected feature set closely represent the entire
features

45
Heuristic Search in Attribute
Selection(cont’d)
 Step-wise backward elimination
 Start with all the feature set elements
 The feature which is most irrelevant will be discarded first
 Then next most irrelevant feature will be discarded and repeated, ...
 Stop when removing the next candidate attribute for removal affects
the pattern significantly
 Combining forward selection and backward elimination
 At each step, the procedure selects the best feature and remove the
most irrelevant
 Decision-tree induction algorithm
 This algorithm generate a decision tree using some of the attributes
 The attributes used in building the decision tree will be taken as
attributes that represents closely the entire attributes

46
Heuristic Search in Attribute
Selection(Example)

47
Numerosity reduction:Histogram Analysis

 Divide data into buckets and store 40

average (sum) for each bucket
35
 Partitioning rules:
30
 Equal-width: equal bucket
range(e.g., the width of $10) 25
 Equal-frequency (or equal-depth) 20
15
10
5
0
30000
10000

20000

40000

50000

60000

70000

80000

90000

100000
48
Numerosity reduction: Clustering
 Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and
diameter) only

 Can be very effective if data is clustered but not if

data is “smeared”

 There are many choices of clustering definitions and

clustering algorithms

49
Numerosity reduction: Sampling
 Obtaining a small sample s to represent the whole
data set N
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Key principle: Choose a representative subset of the
data using suitable sampling technique

50
Types of Sampling
 Simple random sampling
 There is an equal probability of selecting any particular item
 Sampling without replacement
 Once an object is selected, it is removed from the population
 Sampling with replacement
 A selected object is not removed from the population
 Stratified sampling:
 Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the data)
 Used in conjunction with skewed data

51
Sampling: With or without Replacement

W O R
SRS le random
i m p ho u t
( s e wi t
l
samp ment)
pl a c e
re

SRSW
R

Raw Data
52
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

53
Data reduction strategies: by Data Cube
Aggregation
 Data cube aggregation and using it for data mining task reduces the
data set size significantly

 For example, one can aggregate sales amount specified at each year and
quarter into the sum of the sales amount per year

 Multiple levels of aggregation in data cubes further reduce the size of

data to deal with

 One should select appropriate levels of aggregation

 Use the most reduced representation which is sufficient to solve the

task

54
Data Transformation
 A function that maps the entire set of values of a given attribute to a
new set of replacement values such that each old value can be
identified with one of the new values
 Methods
 Smoothing: Remove noise from data
 Attribute/feature construction
 New attributes constructed from the given ones

 Aggregation: Summarization, data cube construction

 Normalization: Scaled to fall within a smaller, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling

 Discretization: Concept hierarchy climbing

55
Normalization
 Measurement unit used can affect the data analysis. E.g. changing
from Kg to pound may lead to very different result.
 expressing an attribute in smaller units will lead to a larger range for that
attribute, and thus tend to give such an attribute greater effect or “weight”

 Normalization avoids dependence on the choice of measurement

units, data to fall within common ranges such as [-1, 1] or [0.0, 1.0]
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Min-max normalization preserves the relationships among the original data
values. It will encounter an “out-of-bounds” error if a future input case for
normalization falls outside of the original data range for A.
 E.g. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
73,600  12,000
(1.0  0)  0  0.716
$73,000 is mapped to 98,000  12,000

56
Normalization(cont’d)
 Z-score normalization (μ: mean, σ: standard deviation):
 useful when the actual minimum and maximum of attribute A are
unknown

73,600  54,000
 1.225
16,000
 E.g. Let μ = 54,000, σ = 16,000. Then
 Normalization by decimal scaling

v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10
 E.g. record values -673 to 672, divide each value by 1000 so that -
673 normalizes to -0.673 and 672 to 0.762

57
Discretization
 Data discritization refers to transforming the data set which is
usually continous into discrete interval values
 Three types of attributes:
 Nominal — finite number of possible values, no ordering among
values. E.g. Marital status( Single, married, widowed, and
divorced)
 Ordinal — values from an ordered set. E.g. Size(big, med, small)
 Continuous — real numbers
 Discretization:
 divide the range of a continuous attribute into intervals
 Some classification algorithms only accept categorical attributes.
 Reduce data size by discretization
 Prepare for further analysis

58
Data Discretization Methods
 Typical methods: All the methods can be applied
recursively
 Binning
 Top-down split, unsupervised
 Histogram analysis
 Top-down split, unsupervised
 Clustering analysis (unsupervised, top-down split or bottom-up
merge)
 Decision-tree analysis (supervised, top-down split)
 Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)

59
Concept Hierarchy Generation
 Concept hierarchy organizes concepts (i.e., country
attribute values) hierarchically and is usually
associated with each dimension in a data
warehouse Region or state
 Concept hierarchy formation: Recursively reduce
the data by collecting and replacing low level
concepts (such as numeric values for age) by city
higher level concepts (such as youth, adult, or
senior)
 Concept hierarchies can be explicitly specified Sub city
by domain experts and/or data warehouse
designers
Kebele
• Concept hierarchy can be automatically formed by the analysis of the number of
distinct values. E.g., for a set of attributes: {Kebele, city, state, country}
 For numeric data, use discretization methods.
60
Assignment(Due Date: June 12)
Review 5+ literatures (books and articles) & write a report (overview, significance, steps
involved, applications, review of 2+ related local and international research works and
concluding remarks) and present in class.
 1-7: meaning, why, its tasks & functions, steps followed, comparison, pros and cons, applications
 8-12: problem statement, methodology, results, findings, recommendation

1. Data Warehouses, Data Mining and Business Intelligence

2. Predictive Modeling
3. Data Mining Models (like CRISP, Hybrid, & other models)
4. Text Mining
5. Web Mining
6. Sentiment/opinion mining
7. Log Mining
8. Knowledge Mining
9. Multimedia Data Mining
10. Spatial Mining
11. Review studies related to ‘Application of data mining in Finance
12. Review studies related to ‘Application of data mining in Insurance’
13. Review studies related to ‘Application of data mining in Health’
14. Review studies related to ‘Application of data mining in Agriculture’

Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
CH 2
No ratings yet
CH 2
36 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Week2 2
No ratings yet
Week2 2
25 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
DMiningKuliah2A (DPreparation) New
No ratings yet
DMiningKuliah2A (DPreparation) New
28 pages
Module 2 (C) - Data Preprocessing
No ratings yet
Module 2 (C) - Data Preprocessing
50 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
L6 Data Preprocessing
No ratings yet
L6 Data Preprocessing
9 pages
Lecture 5
No ratings yet
Lecture 5
27 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Session 4
No ratings yet
Session 4
40 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
85 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
Unit - II
No ratings yet
Unit - II
56 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
DMiningKuliah 2A DPreparation
No ratings yet
DMiningKuliah 2A DPreparation
32 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
3 Preprocessing
No ratings yet
3 Preprocessing
82 pages
Unit 2
No ratings yet
Unit 2
37 pages
Fendt 818
No ratings yet
Fendt 818
4 pages
Design of Multi-Storey Tree Building: Project Report (Synopsis) On
No ratings yet
Design of Multi-Storey Tree Building: Project Report (Synopsis) On
23 pages
Precal Chapter 1 Test Key
No ratings yet
Precal Chapter 1 Test Key
3 pages
Pump System Selection Guide
No ratings yet
Pump System Selection Guide
106 pages
Photocopiable Activities-Part 1
No ratings yet
Photocopiable Activities-Part 1
1 page
Satprep Assignment: Circular Measurement 1.: Diagram Not To Scale
No ratings yet
Satprep Assignment: Circular Measurement 1.: Diagram Not To Scale
5 pages
Ethiopia's Industrial Emission Standards
No ratings yet
Ethiopia's Industrial Emission Standards
37 pages
Rarreg Key
No ratings yet
Rarreg Key
97 pages
Primavera - Scheduling Guideline - 05!05!2017 (Complete PDF
100% (1)
Primavera - Scheduling Guideline - 05!05!2017 (Complete PDF
72 pages
Solution Manual For Manufacturing Engineering and Technology 7th Edition
No ratings yet
Solution Manual For Manufacturing Engineering and Technology 7th Edition
14 pages
Basic Op XFG4 Training Manual Field - en
No ratings yet
Basic Op XFG4 Training Manual Field - en
204 pages
FM DEModulation
No ratings yet
FM DEModulation
6 pages
Pump Room - Building-D
No ratings yet
Pump Room - Building-D
1 page
Nasa
No ratings yet
Nasa
36 pages
UN1000 Modbus Reference Manual 5.001
100% (2)
UN1000 Modbus Reference Manual 5.001
32 pages
Component Location: Generator Set 3512 GENERATOR SET ZAF00278 3512 Generator Set ZAF00001-UP
No ratings yet
Component Location: Generator Set 3512 GENERATOR SET ZAF00278 3512 Generator Set ZAF00001-UP
5 pages
Feasibility of Two-Factor Payment Authentication Using Eeg-Based Bcis
No ratings yet
Feasibility of Two-Factor Payment Authentication Using Eeg-Based Bcis
6 pages
Memory Management in Operating Systems
No ratings yet
Memory Management in Operating Systems
10 pages
Introduction To Steam Turbines: Applications
No ratings yet
Introduction To Steam Turbines: Applications
3 pages
Physics Problem: Parachutist Motion
No ratings yet
Physics Problem: Parachutist Motion
15 pages
A4 Profile Formate La Rece
No ratings yet
A4 Profile Formate La Rece
8 pages
Ufgs 26 20 00
No ratings yet
Ufgs 26 20 00
74 pages
IEC Low-Voltage Surge Protective 2014
No ratings yet
IEC Low-Voltage Surge Protective 2014
58 pages
2002 Children Concept of Addition
No ratings yet
2002 Children Concept of Addition
21 pages
Alternative Multiplication
No ratings yet
Alternative Multiplication
8 pages
Quiz
No ratings yet
Quiz
2 pages
Channel Allocation Protocols
No ratings yet
Channel Allocation Protocols
30 pages
Properties of Spider Silk
No ratings yet
Properties of Spider Silk
9 pages
Ni Ma Bi Math MODEL
No ratings yet
Ni Ma Bi Math MODEL
2 pages
Pioglitazone A Review of Analytical Met - 2014 - Journal of Pharmaceutical Anal
No ratings yet
Pioglitazone A Review of Analytical Met - 2014 - Journal of Pharmaceutical Anal
8 pages

DM-2Preprocessing 2

Uploaded by

DM-2Preprocessing 2

Uploaded by

Data Warehousing and Data Mining

 Why Data Preprocessing?

 Descriptive data summarization

 Data Transformation and Data Discretization

 No quality data, no quality mining results!

 Measure of dispersion includes

 Appropriate for data distributed normally

 E.g. In the last two winter seasons, the city of

 The standard deviation is the square root of the variance.

 50 is outside this interval; hence, it can be considered

 Missing data may need to be inferred.

Name SSN Address Phone # Date Acct Total

John Doe 111-22-3333 1 Main St 111-222-3333 2/12/1999 2200.12

James Smith 222-33-4444 2 Oak St 222-333-4444 12/22/2002 15333.22

 Use, say Numerical constraints to Catch Corrupt Data

 Use statistical techniques to Catch Corrupt Data

 Combined computer and human inspection

 Because of the use of different sources, data that is fine on

 Careful integration of the data from multiple sources may

 Dates are especially problematic:

ID Name City State

ID Name City State

GCR34 Ministry of Finance Addis Ababa AA

Name ID City State

Office of Foreign Affairs

 The same needs to be done for ‘Age’

 Information source #1 says that Hussen lives in Dire Dawa

 Redundant data may be able to be detected by

i1 (ai  A)(bi  B) 

 If rA,B > 0, A and B are positively correlated (A’s values

 Suppose two stocks A and B have the following values

 Question: If the stocks are affected by the same

 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

 Thus, A and B rise together since Cov(A, B) > 0.

 The “best” (and “worst”) attributes are typically determined using

 Step-wise forward selection

 Divide data into buckets and store 40

 Can be very effective if data is clustered but not if

 There are many choices of clustering definitions and

Raw Data Cluster/Stratified Sample

 Multiple levels of aggregation in data cubes further reduce the size of

 One should select appropriate levels of aggregation

 Use the most reduced representation which is sufficient to solve the

 Aggregation: Summarization, data cube construction

 Discretization: Concept hierarchy climbing

 Normalization avoids dependence on the choice of measurement

1. Data Warehouses, Data Mining and Business Intelligence

You might also like