Motivation
Data Preprocessing: Tasks to discover quality data prior to the use of
knowledge extraction algorithms.
Motivation
Data Preprocessing: Tasks to discover quality data
prior to the use of knowledge extraction algorithms. Knowledge
Patterns
Target Processed
data data
Interpretation
Evaluation
Data Mining
data
Preprocessing
Selection
Objectives
To understand the different problems to solve in the
processes of data preprocessing.
To know the problems in the data integration from different
sources and sets of techniques to solve them.
To know the problems related to clean data and to mitigate
imperfect data, together with some techniques to solve
them.
To understand the necessity of applying data
transformation techniques.
To know the data reduction techniques and the necessity of
their application.
Data Preprocessing
1. Introduction. Data Preprocessing
2. Integration, Cleaning and Transformations
3. Imperfect Data
4. Data Reduction
5. Final Remarks
Bibliography:
S. García, J. Luengo, F. Herrera
Data Preprocessing in Data Mining
Springer, Enero 2015
Data Preprocessing in Data Mining
1. Introduction. Data Preprocessing
2. Integration, Cleaning and Transformations
3. Imperfect Data
4. Data Reduction
5. Final Remarks
INTRODUCTION
D. Pyle, 1999, pp. 90:
“The fundamental purpose of data preparation
is to manipulate and transforrm raw data so
that the information content enfolded in the
data set can be exposed, or made more easily
accesible.”
Dorian Pyle
Data Preparation for Data
Mining Morgan Kaufmann
Publishers, 1999
Data Preprocessing
Importance of Data Preprocessing
1. Real data could be dirty and could drive to the
extraction of useless patterns/rules.
This is mainly due to:
Incomplete data: lacking attribute values, …
Data with noise: containing errors or outliers
Inconsistent data (including discrepancies)
Data Preprocessing
Importance of Data Preprocessing
2. Data preprocessing can generate a smaller data set
than the original, which allows us to improve the
efficiency in the Data Mining process.
This performing includes Data Reduction techniques:
Feature selection, sampling or instance selection,
discretization.
Data Preprocessing
Importance of Data Preprocessing
3. No quality data, no quality mining results!
Data preprocessing techniques generate “quality data”,
driving us to obtain “quality patterns/rules”.
Quality decisions must be based on
quality data!
Data Preprocessing
Data preprocessing spends
a very important part of the
total time in a data mining
process.
Data Preprocessing
What is included in data preprocessing?
Real databases usually contain noisy data, missing data, and
inconsistent data, …
Major Tasks in Data Preprocessing
1. Data integration. Fusion of multiple sources in a Data
Warehousing.
2. Data cleaning. Removal of noise and inconsistencies.
3. Missing values imputation.
4. Data Transformation.
5. Data reduction.
12
Data Preprocessing
What is included in data preprocessing?
13
Data Preprocessing
What is included in data preprocessing?
14
Data Preprocessing in Data Mining
1. Introduction. Data Preprocessing
2. Integration, Cleaning and Transformations
3. Imperfect Data
4. Data Reduction
5. Final Remarks
Integration, Cleaning and Transformation
16
Data Integration
Obtain data from different information sources.
Address problems of codification and representation.
Integrate data from different tables to produce
homogeneous information, ...
Data Warehouse
Server
Database 1
Extraction,
aggregation ..
Database 2
17
Data Integration
Examples
Different scales: Salary in dollars versus euros (€)
Derivative attributes: Mensual salary versus annual salary
item Salary/month item Salary
1 5000 6 50,000
2 2400 7 100,000
3 3000 8 40,000
18
Data Cleaning
Objetictives:
• Fix inconsistencies
• Fill/impute missing values,
• Smooth noisy data,
• Identify or remove outliers …
Some Data Mining algorithms have proper methods to
deal with incomplete or noisy data. But in general, these
methods are not very robust. It is usual to perform a
data cleaning previously to their application.
Bibliography:
W. Kim, B. Choi, E.-D. Hong, S.-K. Kim
A taxonomy of dirty data.
Data Mining and Knowledge Discovery 7, 81-99, 2003.
19
Data Cleaning
Data cleaning: Example
Original Data
000000000130.06.19971979-10-3080145722 #000310 111000301.01.000100000000004
0000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.0000
00000000000. 000000000000000.000000000000000.0000000...…
000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.00
0000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.0000
00000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000
000000000.000000000000000.000000000000000.000000000000000.00 0000000000300.00 0000000000300.00
Clean Data
0000000001,199706,1979.833,8014,5722 , ,#000310 ….
,111,03,000101,0,04,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0300,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0300,0300.00
20
Data Cleaning
Data Cleaning: Inconsistent data
Age=“42”
Birth Date=“03/07/1997”
21
Data transformation
Objective: To transform data in the best
way possible to the application of Data Mining algorithms.
Some typical operations:
• Aggregation. i.e. Sum of the totality of month sales in an
unique attribute called anual sales,…
• Data generalization. It is to obtain higher degrees of data from
the currently available, by using concept hierarchies.
streets cities
Numerical age {young, adult, half-age, old}
• Normalization: Change the range [-1,1] or [0,1].
• Lineal transformations, quadratic, polinominal, …
Bibliography:
T. Y. Lin. Attribute Transformation for Data Mining I: Theoretical
Explorations. International Journal of Intelligent Systems 17, 213-222, 2002.
22
Normalization
Objective: convert the values of an attribute to a better
range.
Useful for some techniques such as Neural Networks o
distance-based methods (k-Nearest Neighbors,…).
Some normalization techniques:
Z-score normalization v A
v'
A
min-max normalization: Perform a lineal transformation of the
original data.
[min A, max A ] [newmin A , newmax A ]
v min A
v' (newmax A newmin A ) newmin A
max A min A
The relationships among original data are maintained.
23
Data Preprocessing in Data Mining
1. Introduction. Data Preprocessing
2. Integration, Cleaning and Transformations
3. Imperfect Data
4. Data Reduction
5. Final Remarks
Imperfect data
25
Missing values
26
Missing values
It could be used the next choices, although some of
them may skew the data:
Ignore the tuple. It is usually used when the variable to
classify has no value.
Use a global constant for the replacement. I.e.
“unknown”,”?”,…
Fill tuples by means of mean/deviation of the rest of the
tuples.
Fill tuples by means of mean/deviation of the rest of the
tuples belonging to the same class.
Impute with the most probable value. For this, some
technique of inference could be used, i.e., bayesian or
decision trees.
27
Missing values
15 methods
http://www.keel.es/
28
Missing values
29
Missing values
Bibliography:
WEBSITE: http://sci2s.ugr.es/MVDM/
J. Luengo, S. García, F. Herrera, A Study on the Use of Imputation Methods for
Experimentation with Radial Basis Function Network Classifiers Handling
Missing Attribute Values: The good synergy between RBFs and EventCovering
method. Neural Networks, doi:10.1016/j.neunet.2009.11.014, 23(3) (2010) 406-418.
S. García, F. Herrera, On the choice of the best imputation methods for missing
values considering three groups of classification methods. Knowledge and
Information Systems 32:1 (2012) 77-108, doi:10.1007/s10115-011-0424-2
30
Noise cleaning
Types of examples
Fig. 5.2 The three types of examples considered in this book: safe examples (labeled as
s), borderline examples (labeled as b) and noisy examples (labeled as n). The
continuous line shows the decision boundary between the two classes
31
Noise cleaning
Fig. 5.1 Examples of the interaction between classes: a) small
disjuncts and b) overlapping between classes
32
Noise cleaning
Use of noise filtering techniques in classification
The three noise filters mentioned next, which are the most-
known, use a voting scheme to determine what cases have to be
removed from the training set:
Ensemble Filter (EF)
Cross-Validated Committees Filter
Iterative-Partitioning Filter
33
Ensemble Filter (EF)
• C.E. Brodley, M.A. Friedl. Identifying Mislabeled Training Data. Journal of Artificial Intelligence Research 11
(1999) 131‐167.
• Different learning algorithm (C4.5, 1‐NN and LDA) are used to create classifiers in several subsets of the
training data that serve as noise filters for the training sets.
• Two main steps:
1. For each learning algorithm, a k‐fold cross‐validation is used to tag each training example as correct
(prediction = training data label) or mislabeled (prediction ≠ training data label).
2. A voting scheme is used to identify the final set of noisy examples.
• Consensus voting: it removes an example if it is misclassified by all the classifiers.
• Majority voting: it removes an instance if it is misclassified by more than half of the classifiers.
Training Data
Classifier #1 Classifier #2 Classifier #m
Classification #1 Classification #2 Classification #m
(correct/mislabeled)
( / ) (correct/mislabeled)
( / ) ((correct/mislabeled)
/ )
Voting scheme
(consensus or majority)
Noisy examples
Ensemble Filter (EF)
Cross‐Validated Committees Filter (CVCF)
• S. Verbaeten, A.V. Assche. Ensemble methods for noise elimination in
classification problems. 4th International Workshop on Multiple Classifier Systems
(MCS 2003). LNCS 2709, Springer 2003, Guilford (UK, 2003) 317‐325.
• CVCF is similar to EF two main differences:
1. The same learning algorithm (C4.5) is used to create classifiers in several
subsets of the training data.
The authors of CVCF place special emphasis on using ensembles of decision
trees such as C4.5 because they work well as a filter for noisy data.
2. Each classifier built with the k‐fold cross‐validation is used to tag ALL the
training examples (not only the test set) as correct (prediction = training data
label) or mislabeled (prediction ≠ training data label).
Iterative Partitioning Filter (IPF)
• T.M. Khoshgoftaar, P. Rebours. Improving software quality prediction by noise filtering
techniques. Journal of Computer Science and Technology 22 (2007) 387‐396.
• IPF removes noisy data in multiple iterations using CVCF until a stopping criterion is reached.
• The iterative process stops if, for a number of consecutive iterations, the number of noisy
examples in each iteration is less than a percentage of the size of the training dataset.
Training Data
Current Training Data
CVCF Filter
Current Training Data without Noisy
examples identified by CVCF
NO
STOP?
YES
Final Noisy examples
Noise cleaning
http://www.keel.es/
38
Data Preprocessing in Data Mining
1. Introduction. Data Preprocessing
2. Integration, Cleaning and Transformations
3. Imperfect Data
4. Data Reduction
5. Final Remarks
Data Reduction
40
Feature Selection
The problem of Feature Subset Selection (FSS) consists of
finding a subset of the attributes/features/variables of the
data set that optimizes the probability of success in the
subsequent data mining taks.
Feature Selection
Var. 1. Var. 5 Var. 13
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
A 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
B 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
C 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
D 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
E 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0
F 1 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0
Feature Selection
The problem of Feature Subset Selection (FSS) consists of
finding a subset of the attributes/features/variables of the
data set that optimizes the probability of success in the
subsequent data mining taks.
Why is feature selection necessary?
More attributes do not mean more success in the data
mining process.
Working with less attributes reduces the complexity of the
problem and the running time.
With less attributes, the generalization capability increases.
The values for certain attributes may be difficult and costly
to obtain.
Feature Selection
The outcome of FS would be:
Less data algorithms couls learn quickly
Higher accuracy the algorithm better generalizes
Simpler results easier to understand them
FS has as extension the extraction and construction
of attributes.
Feature Selection
Fig. 7.1 Search
space for FS
Complete Empty
Set of Set of
Features Features
Feature Selection
It can be considered as a search problem
{}
{1} {2} {3} {4}
{1,2} {1}{3} {2,3} {1,4} {2,4} {3,4}
{1,2,3} {1,2,4} {1,3,4} {2,3,4}
{1,2,3,4}
46
Feature Selection
Process
Target (SG) feature (EC)
data Subset Evaluation
generation subset Function
no Yes
Selected
Stop Subset
criteria
47
Feature Selection
Goal functions: There are two different approaches
Filter. The goal function evaluates the subsets basing on
the information they contain. Measures of class
separability, statistical dependences, information theory,…
are used as the goal function.
Wrapper. The goal function consists of applying the
same learning technique that will be used later over the
data resulted from the selection of the features. The
returned value usually is the accuracy rate of the
constructed classifier.
Feature Selection
Process
Fig. 7.2 A filter model for FS
Feature Selection
Filtering measures
Separability measures. They estimate the separability among
classes: euclidean, Mahalanobis,…
I.e. In a two-class problem, a FS process based on this kind of measures
determined that X is bettern than Y if X induces a greater difference than
Y between the two prior conditional probabilities between the classes.
Correlation. Good subset will be those correlated with the class
variable
M
ic
i 1
f ( X 1 ,..., X M ) M M
ij
i 1 j i 1
where ρic is the coefficient of correlation between the variable Xi and the
label c of the class (C) and ρij is the correlation coefficient between Xi
and Xj
Feature Selection
Information theory based measures
Correlation only can estimate lineal dependences. A more powerful
method is the mutual information I(X1,…,M; C)
f ( X 1,..., M ) I ( X 1,..., M ; C ) H (C ) H (C X 1,..., M )
C
P( X 1...M , c )
P ( X 1...M , c ) log dx
c 1 X 1,...,M P( X 1...M ) P ( c )
where H represents the entropy and ωc the c-th label of the class C
Mutual information measures the quantity of uncertainty that
decreases in the class C when the values of the vector X1…M are known.
Due to the complexity of the computation of I, it is usual to use
heurisctics rules
M M M
f ( X 1...M ) I ( X i ;C) I(Xi; X j )
i 1 i 1 j i 1
with β=0.5, as example.
Feature Selection
Consistency measures
The three previous groups of measures try to find those
features than could, maximally, predict the class better than
the remain.
• This approach cannot distinguish between two attributes that
are equally appropriate, it does not detect redundant features.
Consistency measures try to find a minimum number of
features that are able to separate the classes in the same
way that the original data set does.
Feature Selection
Process
Fig. 7.2 A wrapper model for FS
Feature Selection
Process
Fig. 7.2 A filter model for FS
Feature Selection
Advantages
Wrappers:
Accuracy: generally, they are more accurate than filters,
due to the intercation between the classifier used in the goal
function and the training data set.
Generalization capability: they pose capacity to avoid
overfitting due to validation techniques employed.
Filters:
Fast: They usually compute frequencies, much quicker than
training a classifier.
Generality: Due to they evaluate instrinsic properties of the
data and not their interaction with a classifier, they can be
used in any problem.
Feature Selection
Drawbacks
Wrappers:
Very costly: for each evaluation, it is required to learn and
validate a model. It is prohibitive to complex classifiers.
Ad-hoc solutions: The solutions are skewed towards the
used classifier.
Filters:
Trend to include many variables: Normally, it is due to
the fact that there are monotone features in the goal
function used.
• The use should set the threshold to stop.
Feature Selection
Categories
1. According to evaluation: 2. Class availability:
filter Supervised
wrapper Unsupervised
3. According to the search: 4. According to outcome:
Complete O(2N) Ranking
Heurístic O(N2)
Random ?? Subset of features
Feature Selection
Algorithms for getting subset of features
They returns a subset of attributes optimized according
to an evaluation criterion.
Input: x attributes – U evaluation criterion
Subset = {}
Repeat
Sk = generateSubset(x)
if improvement(S, Sk,U)
Subset = Sk
Until StopCriterion()
Output: List, of the most relevant atts.
Feature Selection
Ranking algorithms
They return a list of attributes sorted by an evaluation
criterion.
Input: x attributed – U evaluation criterion
List = {}
For each Attribute xi, i {1,...,N}
vi = compute(xi,U)
set xi within the List according to vi
Output: List, more relevant atts first
Feature Selection
Ranking algorithms
Attributes A1 A2 A3 A4 A5 A6 A7 A8 A9
Ranking A5 A7 A4 A3 A1 A8 A6 A2 A9
A5 A7 A4 A3 A1 A8 (6 attributes)
Feature Selection
Some relevant algorithms:
Focus algorithm. Consistency measure for forward search
Mutual Information based Features Selection (MIFS).
mRMR: Minimum Redundancy Maximum Relevance
Las Vegas Filter (LVF)
Las Vegas Wrapper (LVW)
Relief Algorithm
Instance Selection
Instance selection try to choose the examples which are
relevant to an application, achieving the maximum
performance. The outcome of IS would be:
Less data algorithms learn quicker
Higher accuracy the algorithm better generalizes
Simpler results easier to understand them
IS has as extension the generation of instances
(prototype generation)
Instance Selection
Different size examples
8000 points 2000 points 500 points
Instance Selection
Sampling
Raw data
Instance Selection
Sampling
Raw Data Simple reduction
Instance Selection
Training Prototype
Data Set Selection Instances
(TR) Algorithm Selected (S)
Test Instance-based
Data Set Classifier
(TS)
Fig. 8.1 PS process
Instance Selection
Prototype Selection (instance-based learning)
Properties:
Direction of the search: Incremental, decremental,
batch, hybrid or fixed.
Selection type: Condensation, Edition, Hybrid.
Evaluation type: Filter or wrapper.
Instance Selection
Instance Selection
A pair of classical algorithms:
Classical algorithm of condensation: Condensed Nearest Neighbor (CNN)
Incremental
It only inserts the misclassified instances in the new subsets.
Dependant on the order of presentation.
It only retains borderline examples.
Instance Selection
A pair of classical algorithms:
Classical algorithm for Edition: Edited Nearest Neighbor (ENN)
Batch
It removes those instances which are wrongly classified by using a k-nearest
neighbor scheme (k = 3, 5 or 9).
It “smooths” the borders among classes, but also retains the rest of points.
Instance Selection
Graphical illustrations:
Banana data set with 5,300 instances and two classes. Obtained
subset with CNN and AllKNN (iterative application of ENN with
k=3, 5 y 7).
Instance Selection
Graphical illustrations:
RMHC is an adaptive sampling technique based on local search with
a fixed final rate of retention.
DROP3 is the most-known hybrid technique very use for kNN.
SSMA is an evolutionary approach based on memetic algorithms..
Instance Selection
Training Set Selection
Example Instance Selection
and Decision Tree modeling
Kdd Cup’99. Strata Number: 100
No. % C4.5
Rules Reduction %Ac Trn %Ac Test
C4.5 252 99.97% 99.94%
Cnn Strat 83 81.61% 98.48% 96.43%
Drop1 Strat 3 99.97% 38.63% 34.97%
Drop2 Strat 82 76.66% 81.40% 76.58%
Drop3 Strat 49 56.74% 77.02% 75.38%
Ib2 Strat 48 82.01% 95.81% 95.05%
Ib3 Strat 74 78.92% 99.13% 96.77%
Icf Strat 68 23.62% 99.98% 99.53%
CHC Strat 9 99.68% 98.97% 97.53%
Bibliography: J.R. Cano, F. Herrera, M. Lozano, Evolutionary Stratified Training Set Selection for Extracting Classification
Rules with Trade-off Precision-Interpretability. Data and Knowledge Engineering 60 (2007) 90-108,
doi:10.1016/j.datak.2006.01.008.
Instance Selection
WEBSITE:
http://sci2s.ugr.es/pr/index.php
Bibliography:
S. García, J. Derrac, J.R. Cano and F. Herrera,
Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study.
IEEE Transactions on Pattern Analysis and Machine Intelligence 34:3 (2012) 417-435 doi:
10.1109/TPAMI.2011.142
S. García, J. Luengo, F. Herrera. Data Preprocessing in Data Mining, Springer, 15, 2015
Source Codes (Java):
Discretization
Discrete values are very useful in Data Mining.
They represent more concise information, theay are
easier to understand and closer to the representation of
knowledge.
The discretization is focused on the transformation of
continuous values with an order among in
nominal/categorical values without ordering. It is also a
quantification of numerical attributes.
Nominal value are within a finite domain, so they are also
considered as a data reduction technique.
Discretization
Divide the range of numerical (continuos or not) attributes
into intervals.
Store the labels of the intervales.
Is crucial for association rules and some classification
algorithms, which only accepts discrete data.
Age 5 6 6 9 … 15 16 16 17 20 … 24 25 41 50 65 … 67
Owner of a
0 0 0 0 … 0 1 0 1 1 … 0 1 1 1 1 … 1
Car
AGE [5,15] AGE [16,24] AGE [25,67]
Discretization
Stages in the discretization process
Discretization
Discretization has been developed in several lines
according to the neccesities:
Supervised vs. unsupervised: Whether or not they
consider the objective (class) attributes.
Dinamical vs. Static: Simultaneously when the model is
built or not.
Local vs. Global: Whether they consider a subset of the
instances or all of them.
Top-down vs. Bottom-up: Whether they start with an
empty list of cut points (adding new ones) or with all the
possible cut points (merging them).
Direct vs. Incremental: They make decisions all together
or one by one.
Discretization
Unsupervised algorithms:
• Equal width
• Equal frequency
• Clustering …..
Supervised algorithms:
• Entropy based [Fayyad & Irani 93 and others]
[Fayyad & Irani 93] U.M. Fayyad and K.B. Irani. Multi-interval discretization of continuous-valued attributes
for classification learning. Proc. 13th Int. Joint Conf. AI (IJCAI-93), 1022-1027. Chamberry, France, Aug./
Sep. 1993.
• Chi-square [Kerber 92]
[Kerber 92] R. Kerber. ChiMerge: Discretization of numeric attributes. Proc. 10th Nat. Conf. AAAI, 123-128.
1992.
• … (lots of proposals)
Bibliography: S. García, J. Luengo, José A. Sáez, V. López, F. Herrera, A Survey of Discretization
Techniques: Taxonomy and Empirical Analysis in Supervised Learning.
IEEE Transactions on Knowledge and Data Engineering 25:4 (2013) 734-750, doi: 10.1109/TKDE.2012.35.
Discretization
Example Discretization: Equal width
Temperature:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Count
4
2 2 2 0 2 2
[64,67) [67,70) [70,73) [73,76) [76,79) [79,82) [82,85]
Equal width
Discretization
Example discretization: Equal frequency
Temperature
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Count
4 4 4
2
[64 .. .. .. .. 69] [70 .. 72] [73 .. .. .. .. .. .. .. .. 81] [83 .. 85]
Equal frequency (height) = 4, except for the last
box
Discretization
Which discretizer will be the best?.
As usual, it will depend on the application, user
requiriments, etc.
Evaluation ways:
Total number of intervals
Number of inconsistencies
Predictive accuracy rate of classifiers
Data Preprocessing in Data Mining
1. Introduction. Data Preprocessing
2. Integration, Cleaning and Transformations
3. Imperfect Data
4. Data Reduction
5. Final Remarks
Final Remarks
Data preprocessing is a necessity when we
work with real applications.
Raw data
Knowledge
Data Interpretability
Patterns
Pre- of
Extraction
processing results
Final Remarks
Advantage: Data preprocessing allows us to apply
Learning/Data Mining algorithms easier and quicker,
obaining more quality models/patterns in terms of accuracy
and/or interpretability.
Final Remarks
Advantage: Data preprocessing allows us to apply Learning/Data Mining
algorithms easier and quicker, obaining more quality models/patterns in terms
of accuracy and/or interpretability.
A drawback: Data preprocessing is not a structured area
with a specific methodology for understand the suitability of
preprocessing algorithms for managing a new problems.
Every problem can need a different preprocessing process,
using different tools.
The design of automatic processes of use of the different
stages/techniques is one of the data mining challenges.
Final Remarks
KEEL software for Data Mining (knowledge extraction based
on evolutionary learning) includes a data preprocessing
module (feature selection, missing data imputation, instance
selection, discretization, …)
http://www.keel.es/
Final Remarks
Summary
Data preprocessing is a big issue for data mining
Data processing includes
Data preparation: cleaning, imperfect data,
transformation …
Data reduction and data transformation
A lot a methods have been developed but still an active
area of research
The cooperation between data mining algorithms and
data preparation methods is an interesting/active area.
Bibliography
Dorian Pyle
Morgan Kaufmann, Mar 15, 1999
“Good data preparation is
key to produce valid and
reliable models”
S. García, J. Luengo, F. Herrera
Data Preprocessing in Data Mining
Springer, 15, 2015
Thanks!!!
Data Preprocessing