0% found this document useful (0 votes)

23 views90 pages

Preprocessing

The document discusses the importance of data preprocessing in data mining, highlighting tasks such as integration, cleaning, transformation, and reduction to ensure quality data for effective knowledge extraction. It outlines objectives related to understanding data integration issues, cleaning techniques, and the necessity of data transformation and reduction. The document emphasizes that quality data is essential for obtaining quality mining results and provides various methods for handling imperfect data.

Uploaded by

Sadbin Mohshin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views90 pages

Preprocessing

Uploaded by

Sadbin Mohshin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 90

Motivation

Data Preprocessing: Tasks to discover quality data prior to the use of

knowledge extraction algorithms.
Motivation
Data Preprocessing: Tasks to discover quality data
prior to the use of knowledge extraction algorithms. Knowledge

Patterns

Target Processed
data data
Interpretation
Evaluation
Data Mining
data
Preprocessing

Selection
Objectives

To understand the different problems to solve in the

processes of data preprocessing.
To know the problems in the data integration from different
sources and sets of techniques to solve them.
To know the problems related to clean data and to mitigate
imperfect data, together with some techniques to solve
them.
To understand the necessity of applying data
transformation techniques.
To know the data reduction techniques and the necessity of
their application.
Data Preprocessing

1. Introduction. Data Preprocessing

2. Integration, Cleaning and Transformations
3. Imperfect Data
4. Data Reduction
5. Final Remarks
Bibliography:
S. García, J. Luengo, F. Herrera
Data Preprocessing in Data Mining
Springer, Enero 2015
Data Preprocessing in Data Mining

1. Introduction. Data Preprocessing

2. Integration, Cleaning and Transformations
3. Imperfect Data
4. Data Reduction
5. Final Remarks
INTRODUCTION
D. Pyle, 1999, pp. 90:

“The fundamental purpose of data preparation

is to manipulate and transforrm raw data so
that the information content enfolded in the
data set can be exposed, or made more easily
accesible.”

Dorian Pyle
Data Preparation for Data
Mining Morgan Kaufmann
Publishers, 1999
Data Preprocessing
Importance of Data Preprocessing

1. Real data could be dirty and could drive to the

extraction of useless patterns/rules.

This is mainly due to:

Incomplete data: lacking attribute values, …

Data with noise: containing errors or outliers
Inconsistent data (including discrepancies)
Data Preprocessing
Importance of Data Preprocessing

2. Data preprocessing can generate a smaller data set

than the original, which allows us to improve the
efficiency in the Data Mining process.

This performing includes Data Reduction techniques:

Feature selection, sampling or instance selection,
discretization.
Data Preprocessing
Importance of Data Preprocessing

3. No quality data, no quality mining results!

Data preprocessing techniques generate “quality data”,

driving us to obtain “quality patterns/rules”.

Quality decisions must be based on

quality data!
Data Preprocessing

Data preprocessing spends

a very important part of the
total time in a data mining
process.
Data Preprocessing
What is included in data preprocessing?

Real databases usually contain noisy data, missing data, and

inconsistent data, …

Major Tasks in Data Preprocessing

1. Data integration. Fusion of multiple sources in a Data
Warehousing.

2. Data cleaning. Removal of noise and inconsistencies.

3. Missing values imputation.

4. Data Transformation.
5. Data reduction.

12
Data Preprocessing
What is included in data preprocessing?

13
Data Preprocessing
What is included in data preprocessing?

14
Data Preprocessing in Data Mining

1. Introduction. Data Preprocessing

2. Integration, Cleaning and Transformations
3. Imperfect Data
4. Data Reduction
5. Final Remarks
Integration, Cleaning and Transformation

16
Data Integration

Obtain data from different information sources.

Address problems of codification and representation.
Integrate data from different tables to produce
homogeneous information, ...
Data Warehouse
Server
Database 1

Extraction,
aggregation ..
Database 2

17
Data Integration
Examples
Different scales: Salary in dollars versus euros (€)

Derivative attributes: Mensual salary versus annual salary

item Salary/month item Salary

1 5000 6 50,000
2 2400 7 100,000
3 3000 8 40,000

18
Data Cleaning

Objetictives:
• Fix inconsistencies
• Fill/impute missing values,
• Smooth noisy data,
• Identify or remove outliers …

Some Data Mining algorithms have proper methods to

deal with incomplete or noisy data. But in general, these
methods are not very robust. It is usual to perform a
data cleaning previously to their application.
Bibliography:
W. Kim, B. Choi, E.-D. Hong, S.-K. Kim
A taxonomy of dirty data.
Data Mining and Knowledge Discovery 7, 81-99, 2003.

19
Data Cleaning

Data cleaning: Example

Original Data
000000000130.06.19971979-10-3080145722 #000310 111000301.01.000100000000004
0000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.0000
00000000000. 000000000000000.000000000000000.0000000...…
000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.00
0000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.0000
00000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000
000000000.000000000000000.000000000000000.000000000000000.00 0000000000300.00 0000000000300.00

Clean Data

0000000001,199706,1979.833,8014,5722 , ,#000310 ….
,111,03,000101,0,04,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0300,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0300,0300.00

20
Data Cleaning

Data Cleaning: Inconsistent data

Age=“42”
Birth Date=“03/07/1997”

21
Data transformation
Objective: To transform data in the best
way possible to the application of Data Mining algorithms.

Some typical operations:

• Aggregation. i.e. Sum of the totality of month sales in an
unique attribute called anual sales,…
• Data generalization. It is to obtain higher degrees of data from
the currently available, by using concept hierarchies.
streets cities
Numerical age {young, adult, half-age, old}

• Normalization: Change the range [-1,1] or [0,1].

• Lineal transformations, quadratic, polinominal, …

Bibliography:
T. Y. Lin. Attribute Transformation for Data Mining I: Theoretical
Explorations. International Journal of Intelligent Systems 17, 213-222, 2002.
22
Normalization

Objective: convert the values of an attribute to a better

range.

Useful for some techniques such as Neural Networks o

distance-based methods (k-Nearest Neighbors,…).
Some normalization techniques:
Z-score normalization v A
v'
A

min-max normalization: Perform a lineal transformation of the

original data.
[min A, max A ] [newmin A , newmax A ]
v min A
v' (newmax A newmin A ) newmin A
max A min A
The relationships among original data are maintained.
23
Data Preprocessing in Data Mining

1. Introduction. Data Preprocessing

2. Integration, Cleaning and Transformations
3. Imperfect Data
4. Data Reduction
5. Final Remarks
Imperfect data

25
Missing values

26
Missing values

It could be used the next choices, although some of

them may skew the data:
Ignore the tuple. It is usually used when the variable to
classify has no value.
Use a global constant for the replacement. I.e.
“unknown”,”?”,…
Fill tuples by means of mean/deviation of the rest of the
tuples.
Fill tuples by means of mean/deviation of the rest of the
tuples belonging to the same class.
Impute with the most probable value. For this, some
technique of inference could be used, i.e., bayesian or
decision trees.

27
Missing values

15 methods
http://www.keel.es/
28
Missing values

29
Missing values

Bibliography:
WEBSITE: http://sci2s.ugr.es/MVDM/
J. Luengo, S. García, F. Herrera, A Study on the Use of Imputation Methods for
Experimentation with Radial Basis Function Network Classifiers Handling
Missing Attribute Values: The good synergy between RBFs and EventCovering
method. Neural Networks, doi:10.1016/j.neunet.2009.11.014, 23(3) (2010) 406-418.

S. García, F. Herrera, On the choice of the best imputation methods for missing
values considering three groups of classification methods. Knowledge and
Information Systems 32:1 (2012) 77-108, doi:10.1007/s10115-011-0424-2

30
Noise cleaning
Types of examples

Fig. 5.2 The three types of examples considered in this book: safe examples (labeled as
s), borderline examples (labeled as b) and noisy examples (labeled as n). The
continuous line shows the decision boundary between the two classes

31
Noise cleaning

Fig. 5.1 Examples of the interaction between classes: a) small

disjuncts and b) overlapping between classes

32
Noise cleaning

Use of noise filtering techniques in classification

The three noise filters mentioned next, which are the most-
known, use a voting scheme to determine what cases have to be
removed from the training set:

Ensemble Filter (EF)

Cross-Validated Committees Filter
Iterative-Partitioning Filter

33
Ensemble Filter (EF)
• C.E. Brodley, M.A. Friedl. Identifying Mislabeled Training Data. Journal of Artificial Intelligence Research 11
(1999) 131‐167.
• Different learning algorithm (C4.5, 1‐NN and LDA) are used to create classifiers in several subsets of the
training data that serve as noise filters for the training sets.
• Two main steps:
1. For each learning algorithm, a k‐fold cross‐validation is used to tag each training example as correct
(prediction = training data label) or mislabeled (prediction ≠ training data label).
2. A voting scheme is used to identify the final set of noisy examples.
• Consensus voting: it removes an example if it is misclassified by all the classifiers.
• Majority voting: it removes an instance if it is misclassified by more than half of the classifiers.

Training Data

Classifier #1 Classifier #2 Classifier #m

Classification #1 Classification #2 Classification #m

(correct/mislabeled)
( / ) (correct/mislabeled)
( / ) ((correct/mislabeled)
/ )

Voting scheme
(consensus or majority)

Noisy examples
Ensemble Filter (EF)
Cross‐Validated Committees Filter (CVCF)
• S. Verbaeten, A.V. Assche. Ensemble methods for noise elimination in
classification problems. 4th International Workshop on Multiple Classifier Systems
(MCS 2003). LNCS 2709, Springer 2003, Guilford (UK, 2003) 317‐325.

• CVCF is similar to EF two main differences:

1. The same learning algorithm (C4.5) is used to create classifiers in several

subsets of the training data.

The authors of CVCF place special emphasis on using ensembles of decision

trees such as C4.5 because they work well as a filter for noisy data.

2. Each classifier built with the k‐fold cross‐validation is used to tag ALL the
training examples (not only the test set) as correct (prediction = training data
label) or mislabeled (prediction ≠ training data label).
Iterative Partitioning Filter (IPF)
• T.M. Khoshgoftaar, P. Rebours. Improving software quality prediction by noise filtering
techniques. Journal of Computer Science and Technology 22 (2007) 387‐396.
• IPF removes noisy data in multiple iterations using CVCF until a stopping criterion is reached.
• The iterative process stops if, for a number of consecutive iterations, the number of noisy
examples in each iteration is less than a percentage of the size of the training dataset.

Training Data

Current Training Data

CVCF Filter

Current Training Data without Noisy

examples identified by CVCF

NO
STOP?

YES

Final Noisy examples

Noise cleaning

http://www.keel.es/
38
Data Preprocessing in Data Mining

1. Introduction. Data Preprocessing

2. Integration, Cleaning and Transformations
3. Imperfect Data
4. Data Reduction
5. Final Remarks
Data Reduction

40
Feature Selection

The problem of Feature Subset Selection (FSS) consists of

finding a subset of the attributes/features/variables of the
data set that optimizes the probability of success in the
subsequent data mining taks.
Feature Selection

Var. 1. Var. 5 Var. 13

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
A 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
B 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
C 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
D 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
E 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0
F 1 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0
Feature Selection

The problem of Feature Subset Selection (FSS) consists of

finding a subset of the attributes/features/variables of the
data set that optimizes the probability of success in the
subsequent data mining taks.

Why is feature selection necessary?

More attributes do not mean more success in the data

mining process.
Working with less attributes reduces the complexity of the
problem and the running time.
With less attributes, the generalization capability increases.
The values for certain attributes may be difficult and costly
to obtain.
Feature Selection

The outcome of FS would be:

Less data algorithms couls learn quickly
Higher accuracy the algorithm better generalizes
Simpler results easier to understand them

FS has as extension the extraction and construction

of attributes.
Feature Selection

Fig. 7.1 Search

space for FS

Complete Empty
Set of Set of
Features Features
Feature Selection

It can be considered as a search problem

{}

{1} {2} {3} {4}

{1,2} {1}{3} {2,3} {1,4} {2,4} {3,4}

{1,2,3} {1,2,4} {1,3,4} {2,3,4}

{1,2,3,4}

46
Feature Selection

Process

Target (SG) feature (EC)

data Subset Evaluation
generation subset Function

no Yes
Selected
Stop Subset
criteria

47
Feature Selection

Goal functions: There are two different approaches

Filter. The goal function evaluates the subsets basing on

the information they contain. Measures of class
separability, statistical dependences, information theory,…
are used as the goal function.

Wrapper. The goal function consists of applying the

same learning technique that will be used later over the
data resulted from the selection of the features. The
returned value usually is the accuracy rate of the
constructed classifier.
Feature Selection

Process

Fig. 7.2 A filter model for FS

Feature Selection

Filtering measures
Separability measures. They estimate the separability among
classes: euclidean, Mahalanobis,…
I.e. In a two-class problem, a FS process based on this kind of measures
determined that X is bettern than Y if X induces a greater difference than
Y between the two prior conditional probabilities between the classes.

Correlation. Good subset will be those correlated with the class

variable
M

ic
i 1
f ( X 1 ,..., X M ) M M

ij
i 1 j i 1

where ρic is the coefficient of correlation between the variable Xi and the
label c of the class (C) and ρij is the correlation coefficient between Xi
and Xj
Feature Selection

Information theory based measures

Correlation only can estimate lineal dependences. A more powerful
method is the mutual information I(X1,…,M; C)

f ( X 1,..., M ) I ( X 1,..., M ; C ) H (C ) H (C X 1,..., M )

C
P( X 1...M , c )
P ( X 1...M , c ) log dx
c 1 X 1,...,M P( X 1...M ) P ( c )

where H represents the entropy and ωc the c-th label of the class C
Mutual information measures the quantity of uncertainty that
decreases in the class C when the values of the vector X1…M are known.
Due to the complexity of the computation of I, it is usual to use
heurisctics rules
M M M
f ( X 1...M ) I ( X i ;C) I(Xi; X j )
i 1 i 1 j i 1
with β=0.5, as example.
Feature Selection
Consistency measures

The three previous groups of measures try to find those

features than could, maximally, predict the class better than
the remain.
• This approach cannot distinguish between two attributes that
are equally appropriate, it does not detect redundant features.

Consistency measures try to find a minimum number of

features that are able to separate the classes in the same
way that the original data set does.
Feature Selection

Process

Fig. 7.2 A wrapper model for FS

Feature Selection

Process

Fig. 7.2 A filter model for FS

Feature Selection

Advantages

Wrappers:
Accuracy: generally, they are more accurate than filters,
due to the intercation between the classifier used in the goal
function and the training data set.
Generalization capability: they pose capacity to avoid
overfitting due to validation techniques employed.
Filters:
Fast: They usually compute frequencies, much quicker than
training a classifier.
Generality: Due to they evaluate instrinsic properties of the
data and not their interaction with a classifier, they can be
used in any problem.
Feature Selection
Drawbacks

Wrappers:
Very costly: for each evaluation, it is required to learn and
validate a model. It is prohibitive to complex classifiers.
Ad-hoc solutions: The solutions are skewed towards the
used classifier.

Filters:
Trend to include many variables: Normally, it is due to
the fact that there are monotone features in the goal
function used.
• The use should set the threshold to stop.
Feature Selection

1. According to evaluation: 2. Class availability:

filter Supervised

wrapper Unsupervised

3. According to the search: 4. According to outcome:

Complete O(2N) Ranking

Heurístic O(N2)
Random ?? Subset of features
Feature Selection

Algorithms for getting subset of features

They returns a subset of attributes optimized according
to an evaluation criterion.
Input: x attributes – U evaluation criterion

Subset = {}
Repeat
Sk = generateSubset(x)
if improvement(S, Sk,U)
Subset = Sk
Until StopCriterion()

Output: List, of the most relevant atts.

Feature Selection

Ranking algorithms
They return a list of attributes sorted by an evaluation
criterion.
Input: x attributed – U evaluation criterion

List = {}
For each Attribute xi, i {1,...,N}
vi = compute(xi,U)
set xi within the List according to vi

Output: List, more relevant atts first

Feature Selection

Ranking algorithms

Attributes A1 A2 A3 A4 A5 A6 A7 A8 A9
Ranking A5 A7 A4 A3 A1 A8 A6 A2 A9

A5 A7 A4 A3 A1 A8 (6 attributes)
Feature Selection

Some relevant algorithms:

Focus algorithm. Consistency measure for forward search

Mutual Information based Features Selection (MIFS).
mRMR: Minimum Redundancy Maximum Relevance
Las Vegas Filter (LVF)
Las Vegas Wrapper (LVW)
Relief Algorithm
Instance Selection

Instance selection try to choose the examples which are

relevant to an application, achieving the maximum
performance. The outcome of IS would be:

Less data algorithms learn quicker

Higher accuracy the algorithm better generalizes
Simpler results easier to understand them

IS has as extension the generation of instances

(prototype generation)
Instance Selection

Different size examples

8000 points 2000 points 500 points

Instance Selection

Sampling

Raw data
Instance Selection

Sampling
Raw Data Simple reduction
Instance Selection

Training Prototype
Data Set Selection Instances
(TR) Algorithm Selected (S)

Test Instance-based
Data Set Classifier
(TS)

Fig. 8.1 PS process

Instance Selection

Prototype Selection (instance-based learning)

Properties:

Direction of the search: Incremental, decremental,

batch, hybrid or fixed.

Selection type: Condensation, Edition, Hybrid.

Evaluation type: Filter or wrapper.

Instance Selection
Instance Selection

A pair of classical algorithms:

Classical algorithm of condensation: Condensed Nearest Neighbor (CNN)

Incremental
It only inserts the misclassified instances in the new subsets.
Dependant on the order of presentation.
It only retains borderline examples.
Instance Selection

A pair of classical algorithms:

Classical algorithm for Edition: Edited Nearest Neighbor (ENN)

Batch
It removes those instances which are wrongly classified by using a k-nearest
neighbor scheme (k = 3, 5 or 9).
It “smooths” the borders among classes, but also retains the rest of points.
Instance Selection

Graphical illustrations:

Banana data set with 5,300 instances and two classes. Obtained
subset with CNN and AllKNN (iterative application of ENN with
k=3, 5 y 7).
Instance Selection

Graphical illustrations:

RMHC is an adaptive sampling technique based on local search with

a fixed final rate of retention.
DROP3 is the most-known hybrid technique very use for kNN.
SSMA is an evolutionary approach based on memetic algorithms..
Instance Selection

Training Set Selection

Example Instance Selection
and Decision Tree modeling

Kdd Cup’99. Strata Number: 100

No. % C4.5
Rules Reduction %Ac Trn %Ac Test
C4.5 252 99.97% 99.94%
Cnn Strat 83 81.61% 98.48% 96.43%
Drop1 Strat 3 99.97% 38.63% 34.97%
Drop2 Strat 82 76.66% 81.40% 76.58%
Drop3 Strat 49 56.74% 77.02% 75.38%
Ib2 Strat 48 82.01% 95.81% 95.05%
Ib3 Strat 74 78.92% 99.13% 96.77%
Icf Strat 68 23.62% 99.98% 99.53%
CHC Strat 9 99.68% 98.97% 97.53%
Bibliography: J.R. Cano, F. Herrera, M. Lozano, Evolutionary Stratified Training Set Selection for Extracting Classification
Rules with Trade-off Precision-Interpretability. Data and Knowledge Engineering 60 (2007) 90-108,
doi:10.1016/j.datak.2006.01.008.
Instance Selection

WEBSITE:
http://sci2s.ugr.es/pr/index.php
Bibliography:
S. García, J. Derrac, J.R. Cano and F. Herrera,
Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study.
IEEE Transactions on Pattern Analysis and Machine Intelligence 34:3 (2012) 417-435 doi:
10.1109/TPAMI.2011.142
S. García, J. Luengo, F. Herrera. Data Preprocessing in Data Mining, Springer, 15, 2015

Source Codes (Java):

Discretization

Discrete values are very useful in Data Mining.

They represent more concise information, theay are

easier to understand and closer to the representation of
knowledge.

The discretization is focused on the transformation of

continuous values with an order among in
nominal/categorical values without ordering. It is also a
quantification of numerical attributes.

Nominal value are within a finite domain, so they are also

considered as a data reduction technique.
Discretization

Divide the range of numerical (continuos or not) attributes

into intervals.
Store the labels of the intervales.
Is crucial for association rules and some classification
algorithms, which only accepts discrete data.

Age 5 6 6 9 … 15 16 16 17 20 … 24 25 41 50 65 … 67

Owner of a
0 0 0 0 … 0 1 0 1 1 … 0 1 1 1 1 … 1
Car

AGE [5,15] AGE [16,24] AGE [25,67]

Discretization

Stages in the discretization process

Discretization

Discretization has been developed in several lines

according to the neccesities:

Supervised vs. unsupervised: Whether or not they

consider the objective (class) attributes.

Dinamical vs. Static: Simultaneously when the model is

built or not.

Local vs. Global: Whether they consider a subset of the

instances or all of them.

Top-down vs. Bottom-up: Whether they start with an

empty list of cut points (adding new ones) or with all the
possible cut points (merging them).

Direct vs. Incremental: They make decisions all together

or one by one.
Discretization

Unsupervised algorithms:
• Equal width
• Equal frequency
• Clustering …..
Supervised algorithms:

• Entropy based [Fayyad & Irani 93 and others]

[Fayyad & Irani 93] U.M. Fayyad and K.B. Irani. Multi-interval discretization of continuous-valued attributes
for classification learning. Proc. 13th Int. Joint Conf. AI (IJCAI-93), 1022-1027. Chamberry, France, Aug./
Sep. 1993.

• Chi-square [Kerber 92]

[Kerber 92] R. Kerber. ChiMerge: Discretization of numeric attributes. Proc. 10th Nat. Conf. AAAI, 123-128.
1992.

• … (lots of proposals)

Bibliography: S. García, J. Luengo, José A. Sáez, V. López, F. Herrera, A Survey of Discretization

Techniques: Taxonomy and Empirical Analysis in Supervised Learning.
IEEE Transactions on Knowledge and Data Engineering 25:4 (2013) 734-750, doi: 10.1109/TKDE.2012.35.
Discretization

Example Discretization: Equal width

Temperature:
64 65 68 69 70 71 72 72 75 75 80 81 83 85

Count

4
2 2 2 0 2 2

[64,67) [67,70) [70,73) [73,76) [76,79) [79,82) [82,85]

Equal width
Discretization

Example discretization: Equal frequency

Temperature
64 65 68 69 70 71 72 72 75 75 80 81 83 85

Count

4 4 4
2

[64 .. .. .. .. 69] [70 .. 72] [73 .. .. .. .. .. .. .. .. 81] [83 .. 85]

Equal frequency (height) = 4, except for the last

box
Discretization

Which discretizer will be the best?.

As usual, it will depend on the application, user

requiriments, etc.
Evaluation ways:

Total number of intervals

Number of inconsistencies
Predictive accuracy rate of classifiers
Data Preprocessing in Data Mining

1. Introduction. Data Preprocessing

2. Integration, Cleaning and Transformations
3. Imperfect Data
4. Data Reduction
5. Final Remarks
Final Remarks

Data preprocessing is a necessity when we

work with real applications.

Raw data
Knowledge

Data Interpretability
Patterns
Pre- of
Extraction
processing results
Final Remarks
Advantage: Data preprocessing allows us to apply
Learning/Data Mining algorithms easier and quicker,
obaining more quality models/patterns in terms of accuracy
and/or interpretability.
Final Remarks

Advantage: Data preprocessing allows us to apply Learning/Data Mining

algorithms easier and quicker, obaining more quality models/patterns in terms
of accuracy and/or interpretability.

A drawback: Data preprocessing is not a structured area

with a specific methodology for understand the suitability of
preprocessing algorithms for managing a new problems.
Every problem can need a different preprocessing process,
using different tools.
The design of automatic processes of use of the different
stages/techniques is one of the data mining challenges.
Final Remarks

KEEL software for Data Mining (knowledge extraction based

on evolutionary learning) includes a data preprocessing
module (feature selection, missing data imputation, instance
selection, discretization, …)

http://www.keel.es/
Final Remarks

Summary
Data preprocessing is a big issue for data mining
Data processing includes
Data preparation: cleaning, imperfect data,
transformation …
Data reduction and data transformation
A lot a methods have been developed but still an active
area of research
The cooperation between data mining algorithms and
data preparation methods is an interesting/active area.
Bibliography

Dorian Pyle
Morgan Kaufmann, Mar 15, 1999

“Good data preparation is

key to produce valid and
reliable models”

S. García, J. Luengo, F. Herrera

Data Preprocessing in Data Mining
Springer, 15, 2015
Thanks!!!

Data Preprocessing

Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
DWM
No ratings yet
DWM
14 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
9 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules
No ratings yet
A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules
9 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
23 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
32 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Preprocessing
No ratings yet
Data Preprocessing
67 pages
Data Mining for Quality Improvement
100% (1)
Data Mining for Quality Improvement
34 pages
Correlation
No ratings yet
Correlation
14 pages
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
No ratings yet
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
88 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
Data Preprocessing for Tech Students
No ratings yet
Data Preprocessing for Tech Students
59 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Data Warehouse and Data Mining - Unit 3
No ratings yet
Data Warehouse and Data Mining - Unit 3
14 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
34 pages
Lecture 123
No ratings yet
Lecture 123
20 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Unit 2
No ratings yet
Unit 2
37 pages
DMDW Chapter 3
No ratings yet
DMDW Chapter 3
13 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
DMDW Unit II
No ratings yet
DMDW Unit II
57 pages
Unit-3 Finalized
No ratings yet
Unit-3 Finalized
9 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Preprocessing
No ratings yet
Data Preprocessing
15 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
8 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
Machine Learning Chapter 2
No ratings yet
Machine Learning Chapter 2
37 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Unit - II
No ratings yet
Unit - II
56 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
DWDM LS3 Fall 24 25
No ratings yet
DWDM LS3 Fall 24 25
50 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
GEP June 2024 Chapter2 ECA
No ratings yet
GEP June 2024 Chapter2 ECA
60 pages
GEP June 2024 Chapter2 EAP
No ratings yet
GEP June 2024 Chapter2 EAP
64 pages
GEP June 2024 Chapter1 Box1
No ratings yet
GEP June 2024 Chapter1 Box1
39 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
16 pages
Lab Report 04
No ratings yet
Lab Report 04
10 pages
Decision Trees & Random Forests Guide
No ratings yet
Decision Trees & Random Forests Guide
64 pages
1-Data Mining and Applications
No ratings yet
1-Data Mining and Applications
70 pages
Application Report
No ratings yet
Application Report
1 page
Data Dictionary Quick Reference Guide
No ratings yet
Data Dictionary Quick Reference Guide
2 pages
Videotienda: Database Model Documentation
No ratings yet
Videotienda: Database Model Documentation
9 pages
Se Practicals
No ratings yet
Se Practicals
42 pages
Big Data Technologies Syllabus
No ratings yet
Big Data Technologies Syllabus
69 pages
5 ER - and EER-to-Relational Mapping
No ratings yet
5 ER - and EER-to-Relational Mapping
20 pages
Robert DeHoff Thermodynamics in Material
100% (3)
Robert DeHoff Thermodynamics in Material
606 pages
Database Design & Development Guide
No ratings yet
Database Design & Development Guide
10 pages
ndToolKit User Guide - July 2021
No ratings yet
ndToolKit User Guide - July 2021
32 pages
PSPP Mini PDF
No ratings yet
PSPP Mini PDF
9 pages
Audit Associate KPMG
No ratings yet
Audit Associate KPMG
5 pages
EA Master Data Management MDM V1
No ratings yet
EA Master Data Management MDM V1
34 pages
ODI Class Notes
50% (2)
ODI Class Notes
149 pages
1.6 - Data Integration, 1.10 - Transformation
No ratings yet
1.6 - Data Integration, 1.10 - Transformation
3 pages
Ajo Security Overview
No ratings yet
Ajo Security Overview
12 pages
BI & Data Analytics Professionals
No ratings yet
BI & Data Analytics Professionals
3 pages
Automated Subjective Answer Evaluation
No ratings yet
Automated Subjective Answer Evaluation
19 pages
Chapter 5 Database
No ratings yet
Chapter 5 Database
47 pages
Roles in Data - Learn - Microsoft Docs
No ratings yet
Roles in Data - Learn - Microsoft Docs
4 pages
Cloud Computing Exam: Key Concepts
No ratings yet
Cloud Computing Exam: Key Concepts
2 pages
Database Management Insights
No ratings yet
Database Management Insights
19 pages
Glossary: Data Analytics
No ratings yet
Glossary: Data Analytics
15 pages
SQL & Oracle Constraints Guide
No ratings yet
SQL & Oracle Constraints Guide
15 pages
Chapter 5 IOT Privacy
No ratings yet
Chapter 5 IOT Privacy
20 pages
AI Engineer Intern Assignment - Workplete
No ratings yet
AI Engineer Intern Assignment - Workplete
2 pages
Voice-Based Email System For Visually Impaired (Text To Speech-To-Text)
No ratings yet
Voice-Based Email System For Visually Impaired (Text To Speech-To-Text)
67 pages
Hci - Web Interface Design
No ratings yet
Hci - Web Interface Design
54 pages
Ali Apollo Studio System
No ratings yet
Ali Apollo Studio System
8 pages
Mongodb QB
No ratings yet
Mongodb QB
17 pages
DBMS - Data Models - Javatpoint
No ratings yet
DBMS - Data Models - Javatpoint
7 pages
20220802-EB-Practical Data Mesh
100% (1)
20220802-EB-Practical Data Mesh
71 pages