[go: up one dir, main page]

0% found this document useful (0 votes)
23 views90 pages

Preprocessing

The document discusses the importance of data preprocessing in data mining, highlighting tasks such as integration, cleaning, transformation, and reduction to ensure quality data for effective knowledge extraction. It outlines objectives related to understanding data integration issues, cleaning techniques, and the necessity of data transformation and reduction. The document emphasizes that quality data is essential for obtaining quality mining results and provides various methods for handling imperfect data.

Uploaded by

Sadbin Mohshin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views90 pages

Preprocessing

The document discusses the importance of data preprocessing in data mining, highlighting tasks such as integration, cleaning, transformation, and reduction to ensure quality data for effective knowledge extraction. It outlines objectives related to understanding data integration issues, cleaning techniques, and the necessity of data transformation and reduction. The document emphasizes that quality data is essential for obtaining quality mining results and provides various methods for handling imperfect data.

Uploaded by

Sadbin Mohshin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Motivation

Data Preprocessing: Tasks to discover quality data prior to the use of


knowledge extraction algorithms.
Motivation
Data Preprocessing: Tasks to discover quality data
prior to the use of knowledge extraction algorithms. Knowledge

Patterns

Target Processed
data data
Interpretation
Evaluation
Data Mining
data
Preprocessing

Selection
Objectives

To understand the different problems to solve in the


processes of data preprocessing.
To know the problems in the data integration from different
sources and sets of techniques to solve them.
To know the problems related to clean data and to mitigate
imperfect data, together with some techniques to solve
them.
To understand the necessity of applying data
transformation techniques.
To know the data reduction techniques and the necessity of
their application.
Data Preprocessing

1. Introduction. Data Preprocessing


2. Integration, Cleaning and Transformations
3. Imperfect Data
4. Data Reduction
5. Final Remarks
Bibliography:
S. García, J. Luengo, F. Herrera
Data Preprocessing in Data Mining
Springer, Enero 2015
Data Preprocessing in Data Mining

1. Introduction. Data Preprocessing


2. Integration, Cleaning and Transformations
3. Imperfect Data
4. Data Reduction
5. Final Remarks
INTRODUCTION
D. Pyle, 1999, pp. 90:

“The fundamental purpose of data preparation


is to manipulate and transforrm raw data so
that the information content enfolded in the
data set can be exposed, or made more easily
accesible.”

Dorian Pyle
Data Preparation for Data
Mining Morgan Kaufmann
Publishers, 1999
Data Preprocessing
Importance of Data Preprocessing

1. Real data could be dirty and could drive to the


extraction of useless patterns/rules.

This is mainly due to:

Incomplete data: lacking attribute values, …


Data with noise: containing errors or outliers
Inconsistent data (including discrepancies)
Data Preprocessing
Importance of Data Preprocessing

2. Data preprocessing can generate a smaller data set


than the original, which allows us to improve the
efficiency in the Data Mining process.

This performing includes Data Reduction techniques:


Feature selection, sampling or instance selection,
discretization.
Data Preprocessing
Importance of Data Preprocessing

3. No quality data, no quality mining results!

Data preprocessing techniques generate “quality data”,


driving us to obtain “quality patterns/rules”.

Quality decisions must be based on


quality data!
Data Preprocessing

Data preprocessing spends


a very important part of the
total time in a data mining
process.
Data Preprocessing
What is included in data preprocessing?

Real databases usually contain noisy data, missing data, and


inconsistent data, …

Major Tasks in Data Preprocessing


1. Data integration. Fusion of multiple sources in a Data
Warehousing.

2. Data cleaning. Removal of noise and inconsistencies.

3. Missing values imputation.

4. Data Transformation.
5. Data reduction.

12
Data Preprocessing
What is included in data preprocessing?

13
Data Preprocessing
What is included in data preprocessing?

14
Data Preprocessing in Data Mining

1. Introduction. Data Preprocessing


2. Integration, Cleaning and Transformations
3. Imperfect Data
4. Data Reduction
5. Final Remarks
Integration, Cleaning and Transformation

16
Data Integration

Obtain data from different information sources.


Address problems of codification and representation.
Integrate data from different tables to produce
homogeneous information, ...
Data Warehouse
Server
Database 1

Extraction,
aggregation ..
Database 2

17
Data Integration
Examples
Different scales: Salary in dollars versus euros (€)

Derivative attributes: Mensual salary versus annual salary

item Salary/month item Salary


1 5000 6 50,000
2 2400 7 100,000
3 3000 8 40,000

18
Data Cleaning

Objetictives:
• Fix inconsistencies
• Fill/impute missing values,
• Smooth noisy data,
• Identify or remove outliers …

Some Data Mining algorithms have proper methods to


deal with incomplete or noisy data. But in general, these
methods are not very robust. It is usual to perform a
data cleaning previously to their application.
Bibliography:
W. Kim, B. Choi, E.-D. Hong, S.-K. Kim
A taxonomy of dirty data.
Data Mining and Knowledge Discovery 7, 81-99, 2003.

19
Data Cleaning

Data cleaning: Example


Original Data
000000000130.06.19971979-10-3080145722 #000310 111000301.01.000100000000004
0000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.0000
00000000000. 000000000000000.000000000000000.0000000...…
000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.00
0000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.0000
00000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000
000000000.000000000000000.000000000000000.000000000000000.00 0000000000300.00 0000000000300.00

Clean Data

0000000001,199706,1979.833,8014,5722 , ,#000310 ….
,111,03,000101,0,04,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0300,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0300,0300.00

20
Data Cleaning

Data Cleaning: Inconsistent data

Age=“42”
Birth Date=“03/07/1997”

21
Data transformation
Objective: To transform data in the best
way possible to the application of Data Mining algorithms.

Some typical operations:


• Aggregation. i.e. Sum of the totality of month sales in an
unique attribute called anual sales,…
• Data generalization. It is to obtain higher degrees of data from
the currently available, by using concept hierarchies.
streets cities
Numerical age {young, adult, half-age, old}

• Normalization: Change the range [-1,1] or [0,1].


• Lineal transformations, quadratic, polinominal, …

Bibliography:
T. Y. Lin. Attribute Transformation for Data Mining I: Theoretical
Explorations. International Journal of Intelligent Systems 17, 213-222, 2002.
22
Normalization

Objective: convert the values of an attribute to a better


range.

Useful for some techniques such as Neural Networks o


distance-based methods (k-Nearest Neighbors,…).
Some normalization techniques:
Z-score normalization v A
v'
A

min-max normalization: Perform a lineal transformation of the


original data.
[min A, max A ] [newmin A , newmax A ]
v min A
v' (newmax A newmin A ) newmin A
max A min A
The relationships among original data are maintained.
23
Data Preprocessing in Data Mining

1. Introduction. Data Preprocessing


2. Integration, Cleaning and Transformations
3. Imperfect Data
4. Data Reduction
5. Final Remarks
Imperfect data

25
Missing values

26
Missing values

It could be used the next choices, although some of


them may skew the data:
Ignore the tuple. It is usually used when the variable to
classify has no value.
Use a global constant for the replacement. I.e.
“unknown”,”?”,…
Fill tuples by means of mean/deviation of the rest of the
tuples.
Fill tuples by means of mean/deviation of the rest of the
tuples belonging to the same class.
Impute with the most probable value. For this, some
technique of inference could be used, i.e., bayesian or
decision trees.

27
Missing values

15 methods
http://www.keel.es/
28
Missing values

29
Missing values

Bibliography:
WEBSITE: http://sci2s.ugr.es/MVDM/
J. Luengo, S. García, F. Herrera, A Study on the Use of Imputation Methods for
Experimentation with Radial Basis Function Network Classifiers Handling
Missing Attribute Values: The good synergy between RBFs and EventCovering
method. Neural Networks, doi:10.1016/j.neunet.2009.11.014, 23(3) (2010) 406-418.

S. García, F. Herrera, On the choice of the best imputation methods for missing
values considering three groups of classification methods. Knowledge and
Information Systems 32:1 (2012) 77-108, doi:10.1007/s10115-011-0424-2

30
Noise cleaning
Types of examples

Fig. 5.2 The three types of examples considered in this book: safe examples (labeled as
s), borderline examples (labeled as b) and noisy examples (labeled as n). The
continuous line shows the decision boundary between the two classes

31
Noise cleaning

Fig. 5.1 Examples of the interaction between classes: a) small


disjuncts and b) overlapping between classes

32
Noise cleaning

Use of noise filtering techniques in classification

The three noise filters mentioned next, which are the most-
known, use a voting scheme to determine what cases have to be
removed from the training set:

Ensemble Filter (EF)


Cross-Validated Committees Filter
Iterative-Partitioning Filter

33
Ensemble Filter (EF)
• C.E. Brodley, M.A. Friedl. Identifying Mislabeled Training Data. Journal of Artificial Intelligence Research 11
(1999) 131‐167.
• Different learning algorithm (C4.5, 1‐NN and LDA) are used to create classifiers in several subsets of the
training data that serve as noise filters for the training sets.
• Two main steps:
1. For each learning algorithm, a k‐fold cross‐validation is used to tag each training example as correct
(prediction = training data label) or mislabeled (prediction ≠ training data label).
2. A voting scheme is used to identify the final set of noisy examples.
• Consensus voting: it removes an example if it is misclassified by all the classifiers.
• Majority voting: it removes an instance if it is misclassified by more than half of the classifiers.

Training Data

Classifier #1 Classifier #2 Classifier #m

Classification #1 Classification #2 Classification #m


(correct/mislabeled)
( / ) (correct/mislabeled)
( / ) ((correct/mislabeled)
/ )

Voting scheme
(consensus or majority)

Noisy examples
Ensemble Filter (EF)
Cross‐Validated Committees Filter (CVCF)
• S. Verbaeten, A.V. Assche. Ensemble methods for noise elimination in
classification problems. 4th International Workshop on Multiple Classifier Systems
(MCS 2003). LNCS 2709, Springer 2003, Guilford (UK, 2003) 317‐325.

• CVCF is similar to EF two main differences:

1. The same learning algorithm (C4.5) is used to create classifiers in several


subsets of the training data.

The authors of CVCF place special emphasis on using ensembles of decision


trees such as C4.5 because they work well as a filter for noisy data.

2. Each classifier built with the k‐fold cross‐validation is used to tag ALL the
training examples (not only the test set) as correct (prediction = training data
label) or mislabeled (prediction ≠ training data label).
Iterative Partitioning Filter (IPF)
• T.M. Khoshgoftaar, P. Rebours. Improving software quality prediction by noise filtering
techniques. Journal of Computer Science and Technology 22 (2007) 387‐396.
• IPF removes noisy data in multiple iterations using CVCF until a stopping criterion is reached.
• The iterative process stops if, for a number of consecutive iterations, the number of noisy
examples in each iteration is less than a percentage of the size of the training dataset.

Training Data

Current Training Data

CVCF Filter

Current Training Data without Noisy


examples identified by CVCF

NO
STOP?

YES

Final Noisy examples


Noise cleaning

http://www.keel.es/
38
Data Preprocessing in Data Mining

1. Introduction. Data Preprocessing


2. Integration, Cleaning and Transformations
3. Imperfect Data
4. Data Reduction
5. Final Remarks
Data Reduction

40
Feature Selection

The problem of Feature Subset Selection (FSS) consists of


finding a subset of the attributes/features/variables of the
data set that optimizes the probability of success in the
subsequent data mining taks.
Feature Selection

Var. 1. Var. 5 Var. 13

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
A 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
B 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
C 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
D 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
E 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0
F 1 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0
Feature Selection

The problem of Feature Subset Selection (FSS) consists of


finding a subset of the attributes/features/variables of the
data set that optimizes the probability of success in the
subsequent data mining taks.

Why is feature selection necessary?

More attributes do not mean more success in the data


mining process.
Working with less attributes reduces the complexity of the
problem and the running time.
With less attributes, the generalization capability increases.
The values for certain attributes may be difficult and costly
to obtain.
Feature Selection

The outcome of FS would be:


Less data algorithms couls learn quickly
Higher accuracy the algorithm better generalizes
Simpler results easier to understand them

FS has as extension the extraction and construction


of attributes.
Feature Selection

Fig. 7.1 Search


space for FS

Complete Empty
Set of Set of
Features Features
Feature Selection

It can be considered as a search problem


{}

{1} {2} {3} {4}

{1,2} {1}{3} {2,3} {1,4} {2,4} {3,4}

{1,2,3} {1,2,4} {1,3,4} {2,3,4}

{1,2,3,4}

46
Feature Selection

Process

Target (SG) feature (EC)


data Subset Evaluation
generation subset Function

no Yes
Selected
Stop Subset
criteria

47
Feature Selection

Goal functions: There are two different approaches

Filter. The goal function evaluates the subsets basing on


the information they contain. Measures of class
separability, statistical dependences, information theory,…
are used as the goal function.

Wrapper. The goal function consists of applying the


same learning technique that will be used later over the
data resulted from the selection of the features. The
returned value usually is the accuracy rate of the
constructed classifier.
Feature Selection

Process

Fig. 7.2 A filter model for FS


Feature Selection

Filtering measures
Separability measures. They estimate the separability among
classes: euclidean, Mahalanobis,…
I.e. In a two-class problem, a FS process based on this kind of measures
determined that X is bettern than Y if X induces a greater difference than
Y between the two prior conditional probabilities between the classes.

Correlation. Good subset will be those correlated with the class


variable
M

ic
i 1
f ( X 1 ,..., X M ) M M

ij
i 1 j i 1

where ρic is the coefficient of correlation between the variable Xi and the
label c of the class (C) and ρij is the correlation coefficient between Xi
and Xj
Feature Selection

Information theory based measures


Correlation only can estimate lineal dependences. A more powerful
method is the mutual information I(X1,…,M; C)

f ( X 1,..., M ) I ( X 1,..., M ; C ) H (C ) H (C X 1,..., M )


C
P( X 1...M , c )
P ( X 1...M , c ) log dx
c 1 X 1,...,M P( X 1...M ) P ( c )

where H represents the entropy and ωc the c-th label of the class C
Mutual information measures the quantity of uncertainty that
decreases in the class C when the values of the vector X1…M are known.
Due to the complexity of the computation of I, it is usual to use
heurisctics rules
M M M
f ( X 1...M ) I ( X i ;C) I(Xi; X j )
i 1 i 1 j i 1
with β=0.5, as example.
Feature Selection
Consistency measures

The three previous groups of measures try to find those


features than could, maximally, predict the class better than
the remain.
• This approach cannot distinguish between two attributes that
are equally appropriate, it does not detect redundant features.

Consistency measures try to find a minimum number of


features that are able to separate the classes in the same
way that the original data set does.
Feature Selection

Process

Fig. 7.2 A wrapper model for FS


Feature Selection

Process

Fig. 7.2 A filter model for FS


Feature Selection

Advantages

Wrappers:
Accuracy: generally, they are more accurate than filters,
due to the intercation between the classifier used in the goal
function and the training data set.
Generalization capability: they pose capacity to avoid
overfitting due to validation techniques employed.
Filters:
Fast: They usually compute frequencies, much quicker than
training a classifier.
Generality: Due to they evaluate instrinsic properties of the
data and not their interaction with a classifier, they can be
used in any problem.
Feature Selection
Drawbacks

Wrappers:
Very costly: for each evaluation, it is required to learn and
validate a model. It is prohibitive to complex classifiers.
Ad-hoc solutions: The solutions are skewed towards the
used classifier.

Filters:
Trend to include many variables: Normally, it is due to
the fact that there are monotone features in the goal
function used.
• The use should set the threshold to stop.
Feature Selection

Categories

1. According to evaluation: 2. Class availability:

filter Supervised

wrapper Unsupervised

3. According to the search: 4. According to outcome:

Complete O(2N) Ranking


Heurístic O(N2)
Random ?? Subset of features
Feature Selection

Algorithms for getting subset of features


They returns a subset of attributes optimized according
to an evaluation criterion.
Input: x attributes – U evaluation criterion

Subset = {}
Repeat
Sk = generateSubset(x)
if improvement(S, Sk,U)
Subset = Sk
Until StopCriterion()

Output: List, of the most relevant atts.


Feature Selection

Ranking algorithms
They return a list of attributes sorted by an evaluation
criterion.
Input: x attributed – U evaluation criterion

List = {}
For each Attribute xi, i {1,...,N}
vi = compute(xi,U)
set xi within the List according to vi

Output: List, more relevant atts first


Feature Selection

Ranking algorithms

Attributes A1 A2 A3 A4 A5 A6 A7 A8 A9
Ranking A5 A7 A4 A3 A1 A8 A6 A2 A9

A5 A7 A4 A3 A1 A8 (6 attributes)
Feature Selection

Some relevant algorithms:

Focus algorithm. Consistency measure for forward search


Mutual Information based Features Selection (MIFS).
mRMR: Minimum Redundancy Maximum Relevance
Las Vegas Filter (LVF)
Las Vegas Wrapper (LVW)
Relief Algorithm
Instance Selection

Instance selection try to choose the examples which are


relevant to an application, achieving the maximum
performance. The outcome of IS would be:

Less data algorithms learn quicker


Higher accuracy the algorithm better generalizes
Simpler results easier to understand them

IS has as extension the generation of instances


(prototype generation)
Instance Selection

Different size examples

8000 points 2000 points 500 points


Instance Selection

Sampling

Raw data
Instance Selection

Sampling
Raw Data Simple reduction
Instance Selection

Training Prototype
Data Set Selection Instances
(TR) Algorithm Selected (S)

Test Instance-based
Data Set Classifier
(TS)

Fig. 8.1 PS process


Instance Selection

Prototype Selection (instance-based learning)

Properties:

Direction of the search: Incremental, decremental,


batch, hybrid or fixed.

Selection type: Condensation, Edition, Hybrid.

Evaluation type: Filter or wrapper.


Instance Selection
Instance Selection

A pair of classical algorithms:

Classical algorithm of condensation: Condensed Nearest Neighbor (CNN)


Incremental
It only inserts the misclassified instances in the new subsets.
Dependant on the order of presentation.
It only retains borderline examples.
Instance Selection

A pair of classical algorithms:

Classical algorithm for Edition: Edited Nearest Neighbor (ENN)


Batch
It removes those instances which are wrongly classified by using a k-nearest
neighbor scheme (k = 3, 5 or 9).
It “smooths” the borders among classes, but also retains the rest of points.
Instance Selection

Graphical illustrations:

Banana data set with 5,300 instances and two classes. Obtained
subset with CNN and AllKNN (iterative application of ENN with
k=3, 5 y 7).
Instance Selection

Graphical illustrations:

RMHC is an adaptive sampling technique based on local search with


a fixed final rate of retention.
DROP3 is the most-known hybrid technique very use for kNN.
SSMA is an evolutionary approach based on memetic algorithms..
Instance Selection

Training Set Selection


Example Instance Selection
and Decision Tree modeling

Kdd Cup’99. Strata Number: 100


No. % C4.5
Rules Reduction %Ac Trn %Ac Test
C4.5 252 99.97% 99.94%
Cnn Strat 83 81.61% 98.48% 96.43%
Drop1 Strat 3 99.97% 38.63% 34.97%
Drop2 Strat 82 76.66% 81.40% 76.58%
Drop3 Strat 49 56.74% 77.02% 75.38%
Ib2 Strat 48 82.01% 95.81% 95.05%
Ib3 Strat 74 78.92% 99.13% 96.77%
Icf Strat 68 23.62% 99.98% 99.53%
CHC Strat 9 99.68% 98.97% 97.53%
Bibliography: J.R. Cano, F. Herrera, M. Lozano, Evolutionary Stratified Training Set Selection for Extracting Classification
Rules with Trade-off Precision-Interpretability. Data and Knowledge Engineering 60 (2007) 90-108,
doi:10.1016/j.datak.2006.01.008.
Instance Selection

WEBSITE:
http://sci2s.ugr.es/pr/index.php
Bibliography:
S. García, J. Derrac, J.R. Cano and F. Herrera,
Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study.
IEEE Transactions on Pattern Analysis and Machine Intelligence 34:3 (2012) 417-435 doi:
10.1109/TPAMI.2011.142
S. García, J. Luengo, F. Herrera. Data Preprocessing in Data Mining, Springer, 15, 2015

Source Codes (Java):


Discretization

Discrete values are very useful in Data Mining.

They represent more concise information, theay are


easier to understand and closer to the representation of
knowledge.

The discretization is focused on the transformation of


continuous values with an order among in
nominal/categorical values without ordering. It is also a
quantification of numerical attributes.

Nominal value are within a finite domain, so they are also


considered as a data reduction technique.
Discretization

Divide the range of numerical (continuos or not) attributes


into intervals.
Store the labels of the intervales.
Is crucial for association rules and some classification
algorithms, which only accepts discrete data.

Age 5 6 6 9 … 15 16 16 17 20 … 24 25 41 50 65 … 67

Owner of a
0 0 0 0 … 0 1 0 1 1 … 0 1 1 1 1 … 1
Car

AGE [5,15] AGE [16,24] AGE [25,67]


Discretization

Stages in the discretization process


Discretization

Discretization has been developed in several lines


according to the neccesities:

Supervised vs. unsupervised: Whether or not they


consider the objective (class) attributes.

Dinamical vs. Static: Simultaneously when the model is


built or not.

Local vs. Global: Whether they consider a subset of the


instances or all of them.

Top-down vs. Bottom-up: Whether they start with an


empty list of cut points (adding new ones) or with all the
possible cut points (merging them).

Direct vs. Incremental: They make decisions all together


or one by one.
Discretization

Unsupervised algorithms:
• Equal width
• Equal frequency
• Clustering …..
Supervised algorithms:

• Entropy based [Fayyad & Irani 93 and others]


[Fayyad & Irani 93] U.M. Fayyad and K.B. Irani. Multi-interval discretization of continuous-valued attributes
for classification learning. Proc. 13th Int. Joint Conf. AI (IJCAI-93), 1022-1027. Chamberry, France, Aug./
Sep. 1993.

• Chi-square [Kerber 92]


[Kerber 92] R. Kerber. ChiMerge: Discretization of numeric attributes. Proc. 10th Nat. Conf. AAAI, 123-128.
1992.

• … (lots of proposals)

Bibliography: S. García, J. Luengo, José A. Sáez, V. López, F. Herrera, A Survey of Discretization


Techniques: Taxonomy and Empirical Analysis in Supervised Learning.
IEEE Transactions on Knowledge and Data Engineering 25:4 (2013) 734-750, doi: 10.1109/TKDE.2012.35.
Discretization

Example Discretization: Equal width


Temperature:
64 65 68 69 70 71 72 72 75 75 80 81 83 85

Count

4
2 2 2 0 2 2

[64,67) [67,70) [70,73) [73,76) [76,79) [79,82) [82,85]

Equal width
Discretization

Example discretization: Equal frequency

Temperature
64 65 68 69 70 71 72 72 75 75 80 81 83 85

Count

4 4 4
2

[64 .. .. .. .. 69] [70 .. 72] [73 .. .. .. .. .. .. .. .. 81] [83 .. 85]

Equal frequency (height) = 4, except for the last


box
Discretization

Which discretizer will be the best?.

As usual, it will depend on the application, user


requiriments, etc.
Evaluation ways:

Total number of intervals


Number of inconsistencies
Predictive accuracy rate of classifiers
Data Preprocessing in Data Mining

1. Introduction. Data Preprocessing


2. Integration, Cleaning and Transformations
3. Imperfect Data
4. Data Reduction
5. Final Remarks
Final Remarks

Data preprocessing is a necessity when we


work with real applications.

Raw data
Knowledge

Data Interpretability
Patterns
Pre- of
Extraction
processing results
Final Remarks
Advantage: Data preprocessing allows us to apply
Learning/Data Mining algorithms easier and quicker,
obaining more quality models/patterns in terms of accuracy
and/or interpretability.
Final Remarks

Advantage: Data preprocessing allows us to apply Learning/Data Mining


algorithms easier and quicker, obaining more quality models/patterns in terms
of accuracy and/or interpretability.

A drawback: Data preprocessing is not a structured area


with a specific methodology for understand the suitability of
preprocessing algorithms for managing a new problems.
Every problem can need a different preprocessing process,
using different tools.
The design of automatic processes of use of the different
stages/techniques is one of the data mining challenges.
Final Remarks

KEEL software for Data Mining (knowledge extraction based


on evolutionary learning) includes a data preprocessing
module (feature selection, missing data imputation, instance
selection, discretization, …)

http://www.keel.es/
Final Remarks

Summary
Data preprocessing is a big issue for data mining
Data processing includes
Data preparation: cleaning, imperfect data,
transformation …
Data reduction and data transformation
A lot a methods have been developed but still an active
area of research
The cooperation between data mining algorithms and
data preparation methods is an interesting/active area.
Bibliography

Dorian Pyle
Morgan Kaufmann, Mar 15, 1999

“Good data preparation is


key to produce valid and
reliable models”

S. García, J. Luengo, F. Herrera


Data Preprocessing in Data Mining
Springer, 15, 2015
Thanks!!!

Data Preprocessing

You might also like