0% found this document useful (0 votes)

28 views67 pages

Data Preprocessing: Week 2

This document provides an overview and summary of key topics related to data preprocessing for a data mining course. It discusses the importance of data preprocessing in obtaining quality mining results. Major tasks in data preprocessing like data cleaning, data integration, data transformation, and data reduction are covered. Specific techniques for handling missing values, noisy data, schema integration, detecting redundancy, data normalization, smoothing, aggregation, discretization are explained. The goals of data preprocessing in creating cleaner and reduced representation of the data to improve data mining performance are emphasized.

Uploaded by

kul.bhatia4755

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views67 pages

Data Preprocessing: Week 2

Uploaded by

kul.bhatia4755

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

Data Preprocessing

Week 2

Topics

DataTypes
DataRepositories
DataPreprocessing
Presenthomeworkassignment#1

Team Homework Assignment #2

Read
Readpp.227
pp 227 240,pp.250
240 pp 250 250,andpp.259
250 and pp 259 263thetext
263 the text
book.
p
, , , ,
DoExamples5.3,5.4,5.8,5.9,andExercise5.5.
WriteanRprogramtoverifyyouranswerforExercise5.5.
Refertopp.453 458ofthelabbook.
Explorefrequentpatternminingtoolsandplaythemfor
Exercise5.5
Prepare for the results of the homework assignment
Preparefortheresultsofthehomeworkassignment.
Duedate
beginningofthelectureonFridayFebruary11
g
g
y
y th.

Team Homework Assignment #3

P
Preparefortheonepagedescriptionofyourgroupproject
f h
d
i i
f
j
topic
Prepareforpresentationusingslides
Prepare for presentation using slides
Duedate
beginningofthelectureonFridayFebruary11th.

Figurre 1.4 Data

a Mining as a step in the proce
ess of know
wledge
disco
overy

Why Data Preprocessing Is Important?

WelcometotheRealWorld!
Noqualitydata,noqualityminingresults!
Preprocessingisoneofthemostcriticalstepsinadatamining
process

Major Tasks in Data

P
Preprocessing
i

Figure 2.1 Forms of data preprocessing

Why Data Preprocessing is Beneficial to

D
Data
Mi
Mining?
i ?
Lessdata
dataminingmethodscanlearnfaster
Higheraccuracy
Hi h
dataminingmethodscangeneralizebetter
Simpleresults
Simple results
theyareeasiertounderstand
Fewerattributes
Forthenextroundofdatacollection,savingcanbemade
byremovingredundantandirrelevantfeatures
8

Data Cleaning

Remarks on Data Cleaning

Datacleaningisoneofthebiggestproblemsindata
warehousing RalphKimball
Datacleaningisthenumberoneproblemindata
warehousing DCIsurvey

Why Data Is Dirty?

IIncomplete,noisy,andinconsistentdataare
l
i
di
i
d
commonplacepropertiesoflargerealworld
databases (p 48)
databases.(p.48)
Therearemanypossiblereasonsfornoisydata.(p.
48)

Types of Dirty Data Cleaning Methods

Missingvalues
Fillinmissingvalues
Noisydata(incorrectvalues)
Identifyoutliersandsmoothoutnoisydata

Methods for Missing Values (1)

Ignorethetuple
Fillinthemissingvaluemanually
Fill in the missing value manually
Useaglobalconstanttofillinthemissingvalue

Methods for Missing Values (2)

Usetheattributemeantofillinthemissingvalue
Usetheattributemeanforallsamplesbelongingtothesame
classasthegiventuple
Usethemostprobablevaluetofillinthemissingvalue

Methods for Noisy Data

Binning
Regression
Clustering

Binning

Regression

Clustering

Figure2.12 A2Dplotofcustomerdatawithrespecttocustomerlocationsinacity,
showingthreedataclusters.Eachclustercentroidismarkedwitha+,representingthe
averagepointonspacethatcluster.Outliersmaybedetectedasvaluesthatfalloutsideof
i t
th t l t O tli
b d t t d
l
th t f ll t id f
thesetsofclusters.
18

Data Integration

Schemaintegrationandobjectmatching
Entityidentificationproblem
Redundantdata(betweenattributes)occuroftenwhen
integrationofmultipledatabases
Redundantattributesmaybeabletobedetectedby
correlationanalysis,andchisquaremethod
l ti
l i
d hi
th d

Schema Integration and Object Matching

custom_id andcust_number
Schemaconflict
HandS,and1 and2 forpay_type inonedatabase
Valueconflict
Solutions
metadata(dataaboutdata)
t d t (d t b t d t )

Detecting Redundancy (1)

Ifanattributedcanbederivedfromanotherattributeora
setofattributes,itmayberedundant

Detecting Redundancy (2)

Someredundanciescanbedetectedbycorrelationanalysis
Correlationcoefficientfornumericdata
Chisquaretestforcategoricaldata
Thesecanbealsousedfordatareduction

Chi-square
Chi
square Test

Forcategorical(discrete)data,acorrelationrelationship
betweentwoattributes,AandB,canbediscoveredbya2
test
Giventhedegreeoffreedom,thevalueof2isusedto
decidecorrelationbasedonasignificancelevel

Chi-square Test for Categorical

D t
Data
(Observed Expected )
=
Expected

(o e )
2 =
eij
i =1 j =1
c

count ( A = ai ) count ( B = bj )
eij =
N

p.68

Thelargerthe2 value,themorelikelythevariablesarerelated.

Chi-square
Chi
square Test
fiction
non_ffiction
Total

male
250

female
200

Total
450

50
300

1000
1200

1050
1500

Table2.2 A 2 X 2 contingency table for the data of Example 2.1.

Are gender and preferred_reading correlated?
The2 statisticteststhehypothesisthatgender andpreferred_reading are
independent.Thetestisbasedonasignificantlevel,with(r 1)x(c 1)degreeof
freedom.

Table of Percentage Points of

th 2
the
2 Distribution
Di t ib ti

Correlation Coefficient
N

(a A)(b B) (a b ) N AB
i

rA, B =

i =1

NAB

1 rA, B +1

i i

i =1

NAB
p.68

http://upload.wikimedia.org/wikipedia/commons/0/02/Correlation_examples.png

Data Transformation

Data Transformation/Consolidation

Smoothing
Aggregation
Generalization
Normaliza on
Attributeconstruc on

Smoothing

Removenoisefromthedata
Binning,regression,andclustering

Data Normalization
Min
Min-max
max normalization
v minA
v' =
(new _ maxA new _ minA) + new _ minA
maxA minA

z-score normalization
v' =

v A

Normalization by decimal scaling

v
v' = j
10

where j is the smallest integer such that

Max(||) < 1
33

Data Normalization

Supposethattheminimumandmaximumvaluesforattribute
incomeare$12,000and$98,000,respectively.Wewouldlike
to map income to the range [0 0 1 0] Do Minmax
tomapincometotherange[0.0,1.0].DoMin
max
normalization,zscorenormalization,anddecimalscalingfor
theattributeincome

Attribution Construction

New
Newattributesareconstructedfromgivenattributesand
attributes are constructed from given attributes and
addedinordertohelpimproveaccuracyandunderstanding
ofstructureinhighdimensiondata
Example
Addtheattributearea basedontheattributesheight and
width

Data Reduction

Datareductiontechniquescanbeappliedtoobtainareduced
representation of the data set that is much smaller in volume
representationofthedatasetthatismuchsmallerinvolume,
yetcloselymaintainstheintegrityoftheoriginaldata

Data Reduction

(DataCube)Aggregation
Attribute(Subset)Selection
DimensionalityReduction
Numerosity Reduction
DataDiscretization
C
ConceptHierarchyGeneration
t Hi
h G
ti

The
The Curse of Dimensionality(1)
Dimensionality (1)

Size
Thesizeofadatasetyieldingthesamedensityofdata
pointsinanndimensionalspaceincreaseexponentially
with dimensions
withdimensions
Radius
Alargerradiusisneededtoencloseafactionofthedata
pointsinahighdimensionalspace

The
The Curse of Dimensionality(2)
Dimensionality (2)

Distance
Almosteverypointisclosertoanedgethantoanother
samplepointinahighdimensionalspace
Outlier
Almosteverypointisanoutlierinahighdimensional
space

Data Cube Aggregation

Summarize(aggregate)databasedondimensions
Theresultingdatasetissmallerinvolume,withoutlossof
information necessary for analysis task
informationnecessaryforanalysistask
Concepthierarchiesmayexistforeachattribute,allowingthe
analysisofdataatmultiplelevelsofabstraction

Data Aggregation

Figure 2.13 Sales data for a given branch of AllElectronics for the
years 2002 to 2004. On the left, the sales are shown per quarter. On
the right, the data are aggregated to provide the annual sales
42

Data Cube

Providefastaccesstoprecomputed,summarizeddata,
thereby benefiting online
therebybenefitingon
lineanalyticalprocessingaswellas
analytical processing as well as
datamining

Data Cube - Example

Figure 2.14 A data cube for sales at AllElectronics

Attribute Subset Selection (1)

Attributeselectioncanhelpinthephasesofdatamining
(knowledgediscovery)process
Byattributeselection,
wecanimprovedataminingperformance(speedof
l
learning,predictiveaccuracy,orsimplicityofrules)
i
di i
i li i
f l )
wecanvisualizethedataformodelselected
wereducedimensionalityandremovenoise.
we reduce dimensionality and remove noise

Attribute Subset Selection (2)

Attribute(Feature)selectionisasearchproblem
Searchdirections
(Sequential)Forwardselection
(Sequential)Backwardselection(elimination)
Bidirectionalselection
Decisiontreealgorithm(induction)
D ii t
l ith (i d ti )

Attribute Subset Selection (3)

Attribute(Feature)selectionisasearchproblem
Searchstrategies
Exhaustivesearch
E h ti
h
Heuristicsearch
Selectioncriteria
Selection criteria
Statisticsignificance
Informationgain
g
etc.

Attribute Subset Selection (4)

Figure 2.15. Greedy (heuristic) methods for attribute subset selection

Data Discretization
R
Reducethenumberofvaluesforagiven
d
th
b
f l
f
i
continuousattributebydividingtherange
of the attribute into intervals
oftheattributeintointervals
Intervallabelscanthenbeusedtoreplace
actualdatavalues
t ld t
l
Split(topdown)vs.merge(bottomup)
Discretization canbeperformedrecursively
onanattribute
49/51

Why Discretization is Used?

Reducedatasize.
Transformingquantitativedatatoqualitativedata.

Interval Merge by 2 Analysis

Mergingbased(bottomup)
Merge:Findthebestneighboringintervalsandmergethemto
formlargerintervalsrecursively
ChiMerge [Kerber AAAI1992,SeealsoLiuetal.DMKD2002]

51/51

Initially,eachdistinctvalueofanumericalattributeAis
considered to be one interval
consideredtobeoneinterval
2testsareperformedforeverypairofadjacentintervals
Adjacentintervalswiththeleast 2valuesaremerged
together,sincelow 2valuesforapairindicatesimilarclass
distributions
Thismergeprocessproceedsrecursivelyuntilapredefined
This merge process proceeds recursively until a predefined
stoppingcriterionismet

Entropy-Based
Entropy
Based Discretization
Thegoalofthisalgorithmistofindthesplitwiththe
maximuminformationgain.
Theboundarythatminimizestheentropyoverall
possibleboundariesisselected
Theprocessisrecursivelyappliedtopartitions
obtaineduntilsomestoppingcriterionismet
Suchaboundarymayreducedatasizeandimprove
h b
d
d
d
d
classificationaccuracy
53/51

What is Entropy?
The entropy is a
measure of the
uncertainty associated
with
ith a random
d
variable
i bl
As uncertainty and or
randomness increases
for a result set so does
the entropy
Values range from 0 1
to represent the entropy
of information
54

Entropy Example

EntropyExample
py
p

EntropyExample(contd)
py
p (
)

CalculatingEntropy
g
py
Form classes:
m

Entropy ( S ) = pi log 2 pi
For 2 classes:
For2classes:

i =1

Entropy ( S ) = p1 log 2 p1 p2 log 2 p2

Calculatedbasedontheclassdistributionofthesamples
insetS.
pi istheprobabilityofclassi inS
m isthenumberofclasses(classvalues)
58

CalculatingEntropyFromSplit
g
py
p
Entropy
EntropyofsubsetsS
of subsets S1 andS
and S2 arecalculated.
are calculated
Thecalculationsareweightedbytheirprobabilityofbeingin
setSandsummed.
Informulabelow,
Sistheset
TisthevalueusedtosplitSintoS
T is the value used to split S into S1 andS
and S2

E (S , T ) =

S1
S

Entropy ( S1 ) +

S2
S

Entropy ( S 2 )

CalculatingInformationGain
g
IInformationGain
f
ti G i =Differenceinentropybetween
Diff
i
t
b t
originalset(S)andweightedsplit(S1 +S2)

Gain( S , T ) = Entopy ( S ) E ( S , T )
Gain( S ,56) = 0.991076 0.766289

Gain( S ,56) = 0.224788

0 224788
compareto

Gain( S ,46) = 0.091091

Numeric Concept Hierarchy

Aconcepthierarchyforagivennumericalattribute
definesadiscretization oftheattribute
Recursivelyreducethedatabycollectingand
replacinglowlevelconceptsbyhigherlevelconcepts

A Concept Hierarchy for the

Attribute Price

Figure 2.22. A concept hierarchy for the attribute price.

62/51

Segmentation by Natural Partitioning

Asimply345rulecanbeusedtosegmentnumericdatainto
relativelyuniform,naturalintervals
l
l
f

l
l

Ifanintervalcovers3,6,7or9distinctvaluesatthemostsignificant
digit,partitiontherangeinto3equiwidthintervals
Ifitcovers2,4,or8distinctvaluesatthemostsignificantdigit,
partitiontherangeinto4intervals
Ifitcovers1,5,or10distinctvaluesatthemostsignificantdigit,
, ,
g
g ,
partitiontherangeinto5intervals

63/51

Fig
gure 2.23. Automattic generation of a concep
pt
Hie
erarchy fo
or profit based
b
on 3-4-5 rule
e.

Concept Hierarchy Generation for

C t
Categorical
i l Data
D t

Specification
Specificationofapartialorderingofattributesexplicitlyatthe
of a partial ordering of attributes explicitly at the
schemalevelbyusersorexperts
Specificationofaportionofahierarchybyexplicitdata
grouping
Specificationofasetofattributes,butnotoftheirpartial
ordering

Automatic Concept Hierarchy

Generation
country

15distinctvalules

provinceorstate 365distinctvalues
city

3,567distinctvalues

street

674,339distinctvalues

Based on the number of distinct values per attributes p 95

Basedonthenumberofdistinctvaluesperattributes,p.95

Datapreprocessing
Datacleaning
Missingvalues
Usethemostprobablevaluetofillinthemissingvalue(andfiveothermethods)
Noisydata
Binning;Regression;Clusttering
Dataintegration
EntityIDproblem
Metadata
Redundancy
Correlation analysis (Correlation coefficient chi square test)
Correlationanalysis(Correlationcoefficient,chisquaretest)
Datatrasnformation
Smoothing
Datacleaning
Aggregation
Data reduction
Datareduction
Generailization
Datareduction
Normalization
Minmax;zscore;decimalscaling
AttributeConstruction
Datareduction
Datacubeaggregation
Datacubestoremultidimensionalaggregatedinformation
Attributesubsetselection
Stepwiseforwardselection;stepwisebackwardselection;combination;decisiontreeinduction
Dimensionalityreduction
Discretewavelettrasnforms(DWT);Principlecomponentsanalysis(PCA);
NumerosityReduction
Regressionandloglinearmodels;histograms;clustering;sampling
Datadiscretization
Binning;historgramanalysis;entropybaseddiscretization;
Bi
i hi t
l i
t
b d di
ti ti
Intervalmergingbychisquareanalysis;clusteranalysis;intuitivepartitioning
Concepthierarchy
Concepthierarchygeneration

Data Mining and Preprocessing Guide
No ratings yet
Data Mining and Preprocessing Guide
40 pages
Unit I
No ratings yet
Unit I
57 pages
Week3 - Data Preprocessing, Extraction and Preparation
No ratings yet
Week3 - Data Preprocessing, Extraction and Preparation
34 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
IT326 - Ch3
No ratings yet
IT326 - Ch3
33 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
2-Data Fundamentals For BI - Part1
No ratings yet
2-Data Fundamentals For BI - Part1
39 pages
DP
No ratings yet
DP
44 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
' 3 IT326 - Ch2 - Pre-Processing
No ratings yet
' 3 IT326 - Ch2 - Pre-Processing
48 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data Science Preprocessing Guide
No ratings yet
Data Science Preprocessing Guide
40 pages
3 Processing
No ratings yet
3 Processing
79 pages
Module 2 - DM - AI
No ratings yet
Module 2 - DM - AI
61 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
Correlation
No ratings yet
Correlation
14 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
45 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
52 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
63 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
Slide 05 Chapter3 Data Preprocessing
No ratings yet
Slide 05 Chapter3 Data Preprocessing
58 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
8 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
80 pages
Unit 3
No ratings yet
Unit 3
164 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
Unit 1 C
No ratings yet
Unit 1 C
63 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Lec 3
No ratings yet
Lec 3
31 pages
Mining
No ratings yet
Mining
63 pages
DM Merged
No ratings yet
DM Merged
169 pages
CH 3
No ratings yet
CH 3
68 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
PPT1
No ratings yet
PPT1
93 pages
03 Preprocessing
No ratings yet
03 Preprocessing
80 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
60 pages
Data Preprocessing
No ratings yet
Data Preprocessing
60 pages
Lecture#2 Data Mining MS (DEIM) Spring 2025
No ratings yet
Lecture#2 Data Mining MS (DEIM) Spring 2025
61 pages
Data Mining 3
No ratings yet
Data Mining 3
57 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
56 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
IFRS Consolidation Guide
No ratings yet
IFRS Consolidation Guide
20 pages
HFM DOWN vs POWN System Accounts
No ratings yet
HFM DOWN vs POWN System Accounts
6 pages
HFM Hyperion Financial Management
No ratings yet
HFM Hyperion Financial Management
10 pages
Financial Data Quality Management Guide
No ratings yet
Financial Data Quality Management Guide
66 pages
Apple Polyphenols
No ratings yet
Apple Polyphenols
10 pages
Issues and Terminology Related To Ifrs
No ratings yet
Issues and Terminology Related To Ifrs
12 pages