0% found this document useful (0 votes)

14 views22 pages

Data Preprocessing

Data preprocessing in Machine Learning is essential for transforming raw data into a format suitable for building and training models, enhancing data quality and enabling accurate predictions. It involves several steps including data cleaning, integration, transformation, and reduction to address issues like missing values, noise, and inconsistencies. Quality assessment ensures data completeness, accuracy, consistency, and validity, which are critical for effective data analysis and decision-making.

Uploaded by

Karuna Salgotra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views22 pages

Data Preprocessing

Uploaded by

Karuna Salgotra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

Data Preprocessing

• Data Preprocessing includes the steps we need to follow to

transform or encode data so that it may be easily parsed by
the machine.
• The main agenda for a model to be accurate and precise in
predictions is that the algorithm should be able to easily
interpret the data's features.
• Data preprocessing in Machine Learning is a crucial step that
helps enhance the quality of data to promote the extraction
of meaningful insights from the data.
• Data preprocessing in Machine Learning refers to the
technique of preparing (cleaning and organizing) the raw
data to make it suitable for a building and training Machine
Learning models.
Data Preprocessing Cont..
• In simple words, data preprocessing in Machine Learning is a data
mining technique that transforms raw data into an
understandable and readable format.
Types Data in Machine Learning
Measure of
What is
Scale Examples Description central
meaningful
tendency
Denotes name
Name,
or gender. Only count is
Nominal Gender, Mode
Ordering is not meaningful
Pin code.
possible.

Only ordering is
possible.
Low, Medium, High.
We can order
Ordinal Median, Mode
the values. Not possible to
Rank 1st, 2nd, 3rd. measure the
distance
between values.

0°C does not

Year mean there is no Only difference
Mean, Median,
Interval Temperature temperature. is meaningful,
Mode
Celsius (°C). True zero but not the ratio
absent.

Both difference
Height, True zero is Mean, Median,
Ratio scale and ratio are
Weight. present Mode
meaningful
Need of Data Pre-processing
• The majority of the real-world datasets are highly susceptible to
missing, inconsistent, and noisy data due to their heterogeneous
origin.

• Applying data mining algorithms on this noisy data would not give
quality results as they would fail to identify patterns effectively.

• Data Processing is, therefore, important to improve the overall data

quality.

• Duplicate or missing values may give an incorrect view of the overall

statistics of data.

• Outliers and inconsistent data points often tend to disturb the

model’s overall learning, leading to false predictions.
Need of Data Pre-processing Cont..
The quality can be checked by the following:

Accuracy: To check whether the data entered is

correct or not.
Completeness: To check whether the data is available
or not recorded.
Consistency: To check whether the same data is kept
in all the places that do or do not match.
Timeliness: The data should be updated correctly.
Believability: The data should be trustable.
Interpretability: The understandability of the data.
Steps in Data Preprocessing in Machine Learning
Data Pre-processing Methods
1) Data Cleaning
Data Cleaning is particularly done as part of data preprocessing to clean the data
by filling missing values, smoothing the noisy data, resolving the inconsistency,
and removing outliers.

1) Missing values: Here are a few ways to solve this issue:

a) Ignore those tuples

This method should be considered when the dataset is huge and numerous
missing values are present within a tuple.

b) Fill in the missing values

There are many methods to achieve this, such as filling in the values manually,
predicting the missing values using regression method, or numerical methods like
attribute mean.
1) Data Cleaning Cont..
2. Noisy Data

It involves removing a random error or variance in a measured

variable. It can be done with the help of the following techniques.

a) Binning

It is the technique that works on sorted data values to smoothen

any noise present in it. The data is divided into equal-sized bins,
and each bin/bucket is dealt with independently. All data in a
segment can be replaced by its mean, median or boundary values.
1) Data Cleaning Cont..
b) Regression

This data mining technique is generally used for prediction. It helps

to smoothen noise by fitting all the data points in a regression
function. The linear regression equation is used if there is only one
independent attribute; else Polynomial equations are used.

c) Clustering

Creation of groups/clusters from data having similar values. The

values that don't lie in the
cluster can be treated as noisy data and can be removed.
Data Integration
• Data Integration is one of the data preprocessing
steps that are used to merge the data present in
multiple sources into a single larger data store like
a data warehouse.
• Data Integration is needed especially when we
are aiming to solve a real-world scenario like
detecting the presence of nodules from CT Scan
images.
• The only option is to integrate the images from
multiple medical nodes to form a larger database.
Data Integration Cont..
We might run into some issues while adopting Data
Integration as one of the Data Preprocessing steps:

• Schema integration and object matching: The

data can be present in different formats, and
attributes that might cause difficulty in data
integration.
• Removing redundant attributes from all data
sources.
• Detection and resolution of data value conflicts.
Data Transformation
Once data clearing has been done, we need to
consolidate the quality data into alternate forms by
changing the value, structure, or format of data using
the below-mentioned Data Transformation strategies.

a) Generalization

The low-level or granular data that we have converted

to high-level information by using concept
hierarchies. We can transform the primitive data in
the address like the city to higher-level information
like the country.
Data Transformation Cont..
b) Normalization

It is the most important Data Transformation technique

widely used. The numerical attributes are scaled up or down
to fit within a specified range. In this approach, we are
constraining our data attribute to a particular container to
develop a correlation among different data points.
Normalization can be done in multiple ways, which are
highlighted here:

Min-max normalization
Z-Score normalization
Decimal scaling normalization
Data Transformation Cont..
c) Attribute Selection

New properties of data are created from existing attributes

to help in the data mining process. For example, date of
birth, data attribute can be transformed to another property
like is_senior_citizen for each tuple, which will directly
influence predicting diseases or chances of survival, etc.

d) Aggregation

It is a method of storing and presenting data in a summary

format. For example sales, data can be aggregated and
transformed to show as per month and year format.
Data Reduction
• The size of the dataset in a data warehouse can
be too large to be handled by data analysis and
data mining algorithms.

• One possible solution is to obtain a reduced

representation of the dataset that is much smaller
in volume but produces the same quality of
analytical results.
Data Reduction Cont..
Various Data Reduction strategies:

a) Data cube aggregation

It is a way of data reduction, in which the gathered data is expressed

in a summary form.

b) Dimensionality reduction

In Dimension Reduction, irrelevant, weakly relevant, or redundant

attributes or dimensions may be detected and removed. This
techniques are used to perform feature extraction. The
dimensionality of a dataset refers to the attributes or individual
features of the data. This technique aims to reduce the number of
redundant features we consider in machine learning algorithms.
Data Reduction Cont..
c) Data compression

In Data Compression, encoding mechanisms are used to reduce data set

size. The methods used for Data Compression are Wavelet Transform and
Principle Component Analysis.

d) Discretization

Data discretization is used to divide the attributes of the continuous

nature into data with intervals. This is done because continuous features
tend to have a smaller chance of correlation with the target variable.
Thus, it may be harder to interpret the results. After discretizing a
variable, groups corresponding to the target can be interpreted. For
example, attribute age can be discretized into bins like below 18, 18-44,
44-60, above 60.
Data Reduction Cont..
e) Numerosity reduction

The data can be represented as a model or equation like a regression

model. This would save the burden of storing huge datasets instead of
a model.

f) Attribute subset selection

It is very important to be specific in the selection of attributes.

Otherwise, it might lead to high dimensional data, which are difficult
to train due to underfitting/overfitting problems. Only attributes that
add more value towards model training should be considered, and the
rest all can be discarded.
Data Quality Assessment
Data Quality Assessment includes the statistical approaches one
needs to follow to ensure that the data has no issues. Data is to be
used for operations, customer management, marketing analysis,
and decision making—hence it needs to be of high quality.

The main components of Data Quality Assessment include:

• The completeness with no missing attribute values

• Accuracy and reliability in terms of information
• Consistency in all features
• Maintain data validity
• It does not contain any redundancy
Data Quality Assessment Cont..
Data Quality Assurance process has involves three main
activities.

Data profiling: It involves exploring the data to identify the

data quality issues. Once the analysis of the issues is done,
the data needs to be summarized according to no
duplicates, blank values etc identified.

Data cleaning: It involves fixing data issues.

Data monitoring: It involves maintaining data in a clean

state and having a continuous check on business needs
being satisfied by the data.

Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
Data Preprocessing
No ratings yet
Data Preprocessing
15 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
DWDM LS3 Fall 24 25
No ratings yet
DWDM LS3 Fall 24 25
50 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Unit 2 Preprocessing
No ratings yet
Unit 2 Preprocessing
39 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Pre-processing Guide
No ratings yet
Data Pre-processing Guide
8 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
8 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
Preprocessing
No ratings yet
Preprocessing
90 pages
Data Preprocessing for Analysts
No ratings yet
Data Preprocessing for Analysts
3 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
9 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
22 pages
Data Mining for Quality Improvement
100% (1)
Data Mining for Quality Improvement
34 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Chap 8 Data Preprocessing - Short
No ratings yet
Chap 8 Data Preprocessing - Short
7 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
Pre Processing
No ratings yet
Pre Processing
43 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Unit-3 Finalized
No ratings yet
Unit-3 Finalized
9 pages
Unit - II
No ratings yet
Unit - II
56 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
DWM
No ratings yet
DWM
14 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Preprocessing-Cleaning & Reduction
No ratings yet
Preprocessing-Cleaning & Reduction
42 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Unit 3
No ratings yet
Unit 3
18 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Lecture 3 and 4 - Data Preprocessing
No ratings yet
Lecture 3 and 4 - Data Preprocessing
25 pages
Data Mining for Tech Enthusiasts
No ratings yet
Data Mining for Tech Enthusiasts
61 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
The Impact of Adult Learners
No ratings yet
The Impact of Adult Learners
10 pages
Acm Sigmis Doctoral Dissertation Award Competition
100% (1)
Acm Sigmis Doctoral Dissertation Award Competition
6 pages
The Psychology of Motivation
No ratings yet
The Psychology of Motivation
4 pages
SDLC Glossary
0% (1)
SDLC Glossary
10 pages
Mathematics - Mathematics - Question Paper
No ratings yet
Mathematics - Mathematics - Question Paper
13 pages
PPS Unitwise Imp Questions
No ratings yet
PPS Unitwise Imp Questions
4 pages
Remedial Model or Social Treatment Model
No ratings yet
Remedial Model or Social Treatment Model
51 pages
DEVELOP, EMT-page 1. EDITED PDF
No ratings yet
DEVELOP, EMT-page 1. EDITED PDF
2 pages
E9-Tt-W31 - Bài Tập
No ratings yet
E9-Tt-W31 - Bài Tập
13 pages
ENC 1101 Online Course Schedule
No ratings yet
ENC 1101 Online Course Schedule
3 pages
MBA Live Project Guidelines
No ratings yet
MBA Live Project Guidelines
8 pages
Physical Rehabilitation 7th Edition OSullivan Solution Manual Full Download
100% (1)
Physical Rehabilitation 7th Edition OSullivan Solution Manual Full Download
404 pages
Schreber Father and Son
No ratings yet
Schreber Father and Son
20 pages
Treatment Planning in Radiation Oncology 3rd Edition
No ratings yet
Treatment Planning in Radiation Oncology 3rd Edition
310 pages
BLIE-227-II-S-7 Previous Year
No ratings yet
BLIE-227-II-S-7 Previous Year
7 pages
Notice For 1st Convocation Dt. 22.02.2024
No ratings yet
Notice For 1st Convocation Dt. 22.02.2024
1 page
Dr. Roshini Vijayakumaran Nair: BDS, MDS (Prosthodontist)
No ratings yet
Dr. Roshini Vijayakumaran Nair: BDS, MDS (Prosthodontist)
4 pages
Inglis. L 64067 Life Skills Assignment 1
No ratings yet
Inglis. L 64067 Life Skills Assignment 1
10 pages
Unicaf Prospectus
No ratings yet
Unicaf Prospectus
90 pages
Submitted To: A Study On Customer Perception With Special Reference To Pantene Shampoo in Kurnool City
No ratings yet
Submitted To: A Study On Customer Perception With Special Reference To Pantene Shampoo in Kurnool City
2 pages
Statics: Vector Addition Basics
No ratings yet
Statics: Vector Addition Basics
27 pages
Claudio Monteverdi...
100% (4)
Claudio Monteverdi...
74 pages
Graphing Skills for Students
No ratings yet
Graphing Skills for Students
8 pages
C - EDU10003 Assignment 2 Marking Guide
No ratings yet
C - EDU10003 Assignment 2 Marking Guide
2 pages
Ajmer Doctor Visit Schedule 2012-13
No ratings yet
Ajmer Doctor Visit Schedule 2012-13
62 pages
NADE
No ratings yet
NADE
4 pages
Ielts: International English Language Testing System
No ratings yet
Ielts: International English Language Testing System
24 pages
FCE Speaking Exam Part 1 Questions
100% (1)
FCE Speaking Exam Part 1 Questions
2 pages
Esss Lesson Plan III Yr Cyber
No ratings yet
Esss Lesson Plan III Yr Cyber
6 pages
UUCMS - Unified University College Management System
No ratings yet
UUCMS - Unified University College Management System
2 pages

Data Preprocessing

Uploaded by

Data Preprocessing

Uploaded by

Data Preprocessing

• Data Preprocessing includes the steps we need to follow to

0°C does not

• Data Processing is, therefore, important to improve the overall data

• Duplicate or missing values may give an incorrect view of the overall

• Outliers and inconsistent data points often tend to disturb the

Accuracy: To check whether the data entered is

1) Missing values: Here are a few ways to solve this issue:

a) Ignore those tuples

b) Fill in the missing values

It involves removing a random error or variance in a measured

It is the technique that works on sorted data values to smoothen

This data mining technique is generally used for prediction. It helps

Creation of groups/clusters from data having similar values. The

• Schema integration and object matching: The

The low-level or granular data that we have converted

It is the most important Data Transformation technique

New properties of data are created from existing attributes

It is a method of storing and presenting data in a summary

• One possible solution is to obtain a reduced

a) Data cube aggregation

It is a way of data reduction, in which the gathered data is expressed

In Dimension Reduction, irrelevant, weakly relevant, or redundant

In Data Compression, encoding mechanisms are used to reduce data set

Data discretization is used to divide the attributes of the continuous

The data can be represented as a model or equation like a regression

f) Attribute subset selection

It is very important to be specific in the selection of attributes.

The main components of Data Quality Assessment include:

• The completeness with no missing attribute values

Data profiling: It involves exploring the data to identify the

Data cleaning: It involves fixing data issues.

Data monitoring: It involves maintaining data in a clean

You might also like