03 Data Preparation

Data pre-processing is essential for improving the quality of datasets used in data mining, as it addresses issues like inconsistencies, missing values, and noise. Techniques such as data cleaning, smoothing, normalization, and data reduction are employed to enhance data quality and ensure effective model building. The success of data mining projects heavily relies on the quality of the prepared data, making careful data preparation a prerequisite.

Uploaded by

William D2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views28 pages

03 Data Preparation

Uploaded by

William D2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

DATA PRE-PROCESSING

▪ in real world application, data can be inconsistent,

incomplete and/or noisy
▪ Errors can happen
▪ prediction rate should be lower
▪ produces better models, faster because a good data is a
prerequisite for producing effective models of any type
▪ Analyzing data that has not been carefully screened for
such problems can produce highly misleading results.
▪Then, the success of data mining projects heavily depends
on the quality of the prepared data.
▪Data preparation is about constructing a dataset from one
or more data sources to be used for exploration and
modeling.
Start with an initial dataset to get familiar with the data, to
▪
discover first insights into the data and have a good
understanding of any possible data quality issues.
Data cleaning attempts
to: Fill in missing values
▪ ▪
Smooth out noisy data ▪
Correct inconsistencies
▪ Ignore the tuple with missing values;
▪ Fill in the missing values manually;
▪ Use a global constant to fill in missing values (NULL, unknown,
etc.);
▪ Use the attribute value mean to filling missing values of that
attribute;
▪ Use the attribute mean for all samples belonging to the same
class to fill in the missing values;
▪ Infer the most probable value to fill in the missing value.
▪ The purpose of data smoothing is to eliminate noise.
▪This can be done by:
✓ Binning
✓ Clustering
✓ Regression
▪Binning smooth the data by consulting the value’s
neighborhood.
▪ It aims to remove the noise from the data set [1]
smoothing the data by equal frequency bins [2]
smoothing by bin means;
[3] smoothing by bin boundaries
Unsorted Data for price in dollars:
8, 16, 9, 15, 21, 24, 30, 26, 27, 30, 34, 21
STEP 1: Sort the Data
8, 9, 15, 16, 21, 21, 24, 26,27, 30, 30,
34 Smooth the data by equal frequencies
bins: Bin 1: 8, 9, 15, 16
Bin 2: 21, 21, 24, 26
Bin 3: 27, 30, 30, 34
STEP 1: Sorted Data 8, 9, 15, 16, 21, 21, 24, 26,27, 30, 30, 34
Smooth the data by bin means:
For Bin 1: (8 + 9 + 15 + 16)/4= 12
Bin 1: 12, 12, 12, 12
Bin 2:
Bin 3:
STEP 1: Sorted Data 8, 9, 15, 16, 21, 21, 24, 26,27, 30, 30, 34
Smooth the data by bin boundaries:
▪ Pick the MIN and MAX value
▪ Put the MIN in the left side and MAX on the right side ▪
Middle values in bin boundaries move to its closest neighbor
value with less distance
Bin 1: 8, 8, 16, 16
Bin 2:
Bin 3:
▪Data is organized
into groups of “similar”
values.
▪Rare values that fall
outside these groups are
considered outliers and
are discarded.
▪Data regression
consists of fitting the
data to a function.
▪A linear regression for
instance, finds the line to fit 2
variables so that one variable
can predict the other.
▪More variables can be
involved in a multiple linear
regression.
Data analysis may require a combination of data from
multiple sources into a coherent data store.
There are many challenges:
▪ Schema integration:
CID » C_number » Cust-id » cust#
▪ Semantic heterogeneity
▪ Data value conflicts (different representations or scales,
etc.)
There are many challenges:
▪ Redundant records
▪ Redundant attributes
(redundant if it can be derived from other
attributes) ▪ Correlation analysis P(AÙB)/(P(A)P(B))
1: independent, >1 positive correlation, <1: negative
correlation.
▪ Data is sometimes in a form not appropriate for
mining.
▪ Either the algorithm at hand can not handle it, the
form of the data is not regular, or the data itself is not
specific enough.
▪ Normalization
(to compare carrots with carrots)
▪ Smoothing
▪ Aggregation
(summary operation applied to data)
▪ Generalization
(low level data is replaced with level data – concept
hierarchy
Min-max normalization: linear transformation from v to
v’ ▪ v’= v-min/(max – min) (newmax – newmin) + new min
▪ Example:
transform $30000 between [10000..45000] into [0..1] → Î30-10/35(1)+0=0.514
Zscore normalization:
▪ normalization v into v’ based on attribute value mean and standard deviation
▪ v’=v-Mean/StandardDeviation
Normalization by decimal scaling:
▪ moves the decimal point of v by j positions such that j is the minimum
number of positions moved to the decimal of the absolute maximum
value to make is fall in [0..1]. v’=v/10j
▪ Example:
if v ranges between –56 and 9976, j=4 →
v’ ranges between –0.0056 and 0.9976
▪ The data is often too large.
▪ Reducing the data can improve performance.
▪ Data reduction consists of reducing the representation of
the data set while producing the same (or almost the same)
results.
Data reduction
includes: Data cube
▪
aggregation
▪ Dimension reduction
▪ Data compression
▪Discretization
▪Numerosity reduction
▪ Reduce the data to
the concepts level
needed in the
analysis
▪Queries regarding
aggregated
information should
be answered using data
cube when possible
▪ Feature selection (i.e.,
attribute subset selection)
▪Use heuristics: select local
‘best’ (or most pertinent)
attribute
▪Decision Tree Induction
Data compression reduces the size of data.
▪ saves storage space.
▪ saves communication time.

Data compression is beneficial if data mining algorithms can

manipulate compressed data directly without
uncompressing it.
Parametric
▪Regression (a model or function estimating the
distribution instead of the data.)
Non-parametric
▪Histograms
▪Clustering
▪Sampling
A popular data reduction technique:
▪ Divide data into buckets and store representation of
buckets ▪ (sum, count, etc.)
▪ Equi-width
▪ Equi-depth
▪ V-Optimal
▪ MaxDiff
▪ Partition data into clusters
based on “closeness” in space.
▪ Retain representatives of
clusters (centroids) and
outliers.
▪ Effectiveness depends upon the
distribution of data
▪Hierarchical clustering is possible (multi
resolution).
Allows a large data set to be represented by a much smaller
random sample of the data (sub-set).
▪Simple random sample without replacement (SRSWOR)
▪Simple random sampling with replacement (SRSWR)
▪Cluster sample (SRSWOR or SRSWR from clusters)
▪Stratified sample
▪ Discretization is used to reduce the number of values for
a given continuous attribute, by dividing the range of the
attribute into intervals.
Discretization can reduce the data set, and can also be used
▪
to generate concept hierarchies automatically

Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Unit 2
No ratings yet
Unit 2
37 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
ML 4
No ratings yet
ML 4
17 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
DMiningKuliah2A (DPreparation) New
No ratings yet
DMiningKuliah2A (DPreparation) New
28 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
DWDM Unit-Ii
No ratings yet
DWDM Unit-Ii
18 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Data Pre-Processing Guide
No ratings yet
Data Pre-Processing Guide
33 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Unit 2
No ratings yet
Unit 2
34 pages
CH 2
No ratings yet
CH 2
36 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
Week2 2
No ratings yet
Week2 2
25 pages
UNIT 3 Data Preprocessing
No ratings yet
UNIT 3 Data Preprocessing
22 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
DMiningKuliah 2A DPreparation
No ratings yet
DMiningKuliah 2A DPreparation
32 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
85 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
Data Preprocessing for Analysts
No ratings yet
Data Preprocessing for Analysts
3 pages
Unit II - Data Preprocessing and Classification RSK-1
No ratings yet
Unit II - Data Preprocessing and Classification RSK-1
115 pages
29402data Preprocessing - Data Cleaning
No ratings yet
29402data Preprocessing - Data Cleaning
12 pages
10-2 Data Analysis and Pre-Processing Part 4 PDF
No ratings yet
10-2 Data Analysis and Pre-Processing Part 4 PDF
23 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Data Science Is A Multidisciplinary Field That Uses Scientific Methods
No ratings yet
Data Science Is A Multidisciplinary Field That Uses Scientific Methods
2 pages
Kendall's Tau and Spearman's Rank Correlation Coefficient Assess Statistical
No ratings yet
Kendall's Tau and Spearman's Rank Correlation Coefficient Assess Statistical
7 pages
Esther Project Market 1
No ratings yet
Esther Project Market 1
79 pages
How To Write Research Article For A Journal: Techniques and Rules
No ratings yet
How To Write Research Article For A Journal: Techniques and Rules
11 pages
MBA Sahil Operations Management
No ratings yet
MBA Sahil Operations Management
6 pages
Management Science (Introduction)
No ratings yet
Management Science (Introduction)
15 pages
How To Use WarpPLS
No ratings yet
How To Use WarpPLS
34 pages
Dabd 1 Sessional Set A
No ratings yet
Dabd 1 Sessional Set A
2 pages
Digital vs Plaster Models in RLSA
No ratings yet
Digital vs Plaster Models in RLSA
40 pages
Practical Research 2 PDF
100% (2)
Practical Research 2 PDF
60 pages
V 12 I 8 Aug 2012
100% (2)
V 12 I 8 Aug 2012
591 pages
Educational Research Overview
No ratings yet
Educational Research Overview
11 pages
Antecedents of Green Purchase Behaviour: An Examination of Altruism and Environmental Knowledge
No ratings yet
Antecedents of Green Purchase Behaviour: An Examination of Altruism and Environmental Knowledge
20 pages
Capturing The Value of Design Thinking in Different Innovation Practices
No ratings yet
Capturing The Value of Design Thinking in Different Innovation Practices
17 pages
Distance Learners Toward Mobile-Assisted Language Learning: Basis For Gamified and Mobile English Learning Development
No ratings yet
Distance Learners Toward Mobile-Assisted Language Learning: Basis For Gamified and Mobile English Learning Development
15 pages
Structural Analysis Ramamrutham PDF
50% (6)
Structural Analysis Ramamrutham PDF
3 pages
Gore Shruti Pravin Data Analyst CV 02 PDF
No ratings yet
Gore Shruti Pravin Data Analyst CV 02 PDF
2 pages
Data Analysis and Hypothesis Testing
No ratings yet
Data Analysis and Hypothesis Testing
20 pages
A Study On Financial Performance of Axis Bank in Comparison of HDFC Bank and Icici Bank
No ratings yet
A Study On Financial Performance of Axis Bank in Comparison of HDFC Bank and Icici Bank
99 pages
Data Science Big Data Analytics - 2015 - EMC Education Services - Index
No ratings yet
Data Science Big Data Analytics - 2015 - EMC Education Services - Index
14 pages
TAPMI MBA Placement Success 2024
No ratings yet
TAPMI MBA Placement Success 2024
8 pages
cs3352 Foundation of Data Science
No ratings yet
cs3352 Foundation of Data Science
117 pages
(Ebook PDF) Marketing Research 5th by Naresh Malhotra Instant Download
100% (2)
(Ebook PDF) Marketing Research 5th by Naresh Malhotra Instant Download
54 pages
Energy - Audit - Draft Report - Bajaber Millers LTD v1 22 11 2017
No ratings yet
Energy - Audit - Draft Report - Bajaber Millers LTD v1 22 11 2017
73 pages
European Union's European Platform For Undeclared Work Toolkit On Risk Assessments For More Efficient Inspections (30 Pages, 2019)
No ratings yet
European Union's European Platform For Undeclared Work Toolkit On Risk Assessments For More Efficient Inspections (30 Pages, 2019)
30 pages
1704375898760utkarsh 2023 8 PDF
No ratings yet
1704375898760utkarsh 2023 8 PDF
1 page
Van Der Post - Hayden Advanced Excel For Financial Modelling - Integrating Python For Next Level Analy
No ratings yet
Van Der Post - Hayden Advanced Excel For Financial Modelling - Integrating Python For Next Level Analy
260 pages
CV Akanksha Pandey
No ratings yet
CV Akanksha Pandey
4 pages
2021 Degree Thesis IsmailBACHCHAR
No ratings yet
2021 Degree Thesis IsmailBACHCHAR
45 pages
Benchmarking For Competitive Advantage - Process vs. Performance Benchmarking For Financial Results
100% (22)
Benchmarking For Competitive Advantage - Process vs. Performance Benchmarking For Financial Results
12 pages

03 Data Preparation

Uploaded by

03 Data Preparation

Uploaded by

DATA PRE-PROCESSING

▪ in real world application, data can be inconsistent,

Data compression is beneficial if data mining algorithms can

You might also like