0% found this document useful (0 votes)

10 views13 pages

Unit II 10 Data Preprocessing Techniques

Data preprocessing transforms raw data into a structured format suitable for analysis, involving tasks like handling missing values, encoding categorical variables, and scaling numerical features. Techniques for handling missing data include imputation methods, which fill in missing values, and deletion methods, which remove incomplete cases. Encoding categorical data converts categories into numerical form, while feature scaling ensures numerical variables are on a similar scale for accurate analysis.

Uploaded by

victor.seelan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views13 pages

Unit II 10 Data Preprocessing Techniques

Uploaded by

victor.seelan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

Data Preprocessing Techniques

Data Preprocessing is the process of transforming raw, unorganized, and often

incomplete data into a clean, structured, and suitable format for analysis or modeling.
It involves tasks such as
1. Handling missing values
2. Encoding categorical variables (Encoding Categorical Data)
3. Scaling numerical features (Feature Scaling)
4. Removing inconsistencies, ensuring that the dataset is accurate, consistent, and
ready for statistical analysis or machine learning.

Handling Missing Data

Missing data occurs when no value is stored for a variable in an observation. If not addressed
properly, it can reduce statistical power, bias results, and affect model performance.
Techniques for Handling Missing Data
A. Imputation Methods
B. Deletion Methods

A. Imputation Methods
Imputation involves filling in missing values with estimated or plausible values based on other
available information in the dataset. This approach aims to retain the full dataset and prevent
information loss associated with removing incomplete cases.

Advantages of imputation methods

 Preserves sample size: Maintains the original dataset size, reducing potential loss of
valuable information.
 Reduces bias: If performed appropriately, especially for data that is Missing at
Random (MAR) or Missing Completely at Random (MCAR), imputation can help
avoid bias that could arise from deleting incomplete records.
 Allows for the use of standard analysis techniques: Once missing values are imputed,
standard statistical and machine learning methods can be applied to the complete
dataset.

Disadvantages of imputation methods

 Potential for introducing bias: If the imputation method is not appropriate or the
missing data mechanism is not understood, imputation can introduce bias and distort
the relationships between variables.
 Underestimation of standard errors: Some single imputation methods can
underestimate standard errors, potentially leading to overstating the precision of the
results.
 Increased complexity: Advanced imputation methods like multiple imputation can be
more complex to implement and may require specialized knowledge.
 Computationally intensive: Some advanced imputation methods can be
computationally demanding, particularly for large datasets.

These involve filling in missing values using estimates based on available data.
1. Mean Imputation: Replace with column mean (works for MCAR, symmetric data).
Replace missing values with the mean of the observed values of that variable. Simple
but can reduce variability.
2. Median Imputation: Replace with median (better for skewed data). Or Use the
median value, which is more robust to outliers than the mean.
3. Mode Imputation: Replace with most frequent category (categorical data). Or Use
the most frequent value, typically applied for categorical variables.
4. Advanced Methods: Regression Imputation, KNN Imputation and Multiple
Imputation

B. Deletion Methods
Remove records or variables with missing data. Deletion methods involve removing cases
(rows) or variables (columns) that contain missing values. While simpler to implement,
deletion can lead to a reduction in sample size and potentially introduce bias if the
missingness is not Missing Completely at Random (MCAR).
Advantages of deletion methods
 Simplicity and Ease of Implementation: Deletion methods are relatively
straightforward to understand and implement.
 Creates a complete dataset: The resulting dataset after deletion is free of missing
values, allowing for the use of standard analysis techniques designed for complete
datasets.
Disadvantages of deletion methods
 Loss of information: Deleting cases or variables with missing values can lead to a
significant reduction in the usable data, which can reduce the statistical power of the
analysis.
 Potential for introducing bias: If the missingness is related to the outcome or other
variables, deleting incomplete cases can introduce bias into the analysis.
 Reduced generalizability: A smaller sample size after deletion may lead to results
that are less generalizable to the broader population.
Types of deletion methods
 Listwise Deletion: Entire records (rows) with any missing values are removed from
the dataset.
 Pairwise Deletion: Uses available data for each analysis, ignoring missing data on a
case-by-case basis.

Choosing between imputation and deletion methods

The decision of whether to use imputation or deletion methods depends on several factors,
including:
 Amount of missing data: For a small amount of missing data, deletion might be
acceptable, particularly if the data is MCAR.
 Mechanism of missingness: Understanding why the data is missing (MCAR, MAR,
or MNAR) is crucial.
 Type of data (numerical or categorical): Some imputation methods are better suited
for specific data types.
 Research goals and desired level of accuracy: More complex and robust methods like
multiple imputation might be needed for analyses requiring high accuracy and valid
statistical inferences.

Encoding categorical variables (Encoding Categorical Data)

Many statistical models and machine learning algorithms require numerical inputs, but real-
world datasets often contain categorical variables (e.g., Gender, Color, City).
Encoding categorical data is the process of converting these categories into numerical form
while preserving the information they carry.
Encoding categorical data is a crucial step in data preprocessing, especially for statistical
and machine learning models that require numerical input. Here's a detailed review of
the main encoding methods:
Types of Categorical Data
 Nominal Data: Categories without any order (e.g., colors like red, blue, green).
 Ordinal Data: Categories with a meaningful order or ranking (e.g., low, medium,
high).
Common Encoding Techniques
 One-Hot Encoding
 Label Encoding
 Ordinal Encoding
 Effect Encoding (Deviation Encoding)
 Count (Frequency) Encoding

One-Hot Encoding
 Converts each category into a binary vector with one column per category.
 For each data point, the column corresponding to its category is set to 1, others
to 0.
 Ideal for nominal data because it does not imply any order.
 Disadvantage: increases dimensionality, which can lead to computational
challenges especially with high-cardinality features.
 Useful for linear models and algorithms that cannot handle categorical data
directly.
Label Encoding
 Assigns an integer value to each category.
 Suitable for ordinal data where the order matters.
 Simple and efficient.
 Risk: For nominal data, it may wrongly imply ordinality, leading to model
bias.
Ordinal Encoding
 Similar to label encoding but explicitly used for ordinal categories with a
meaningful order.
 The values reflect the order but can assume linear relationships which may not
always be appropriate.
Effect Encoding (Deviation Encoding)
 Uses values 1, 0, and -1 instead of binary 0 and 1.
 Helps handle multicollinearity in linear models and makes coefficients
interpretable.
 Used in linear regression and ANOVA models.
Count (Frequency) Encoding
 Converts categories into the frequency of occurrence in the dataset.
 Reduces dimensionality compared to one-hot encoding.
 Might lose distinctiveness between categories with the same frequency.

Encoding Suitable Data Key Advantage Key Common Use

Method Type Disadvantage Cases

One-Hot No ordinality High Linear models,

Encoding Nominal implied dimensionality neural networks

Ordinal features,
Label Simple and Misleading for tree-based
Encoding Ordinal preserves order nominal data models

Ordinal Preserves Assumes linearity Ordered

Encoding Ordinal meaningful order in order categories

Handles Linear
Effect multicollinearity regression,
Encoding Ordinal/Nominal well More complex ANOVA

High-cardinality
Count Reduces Loses category categorical
Encoding Nominal/Ordinal dimensionality distinctiveness features

Scaling Numerical Features (Feature Scaling)

Feature scaling is the process of transforming numerical variables so that they are on a
similar scale. It is important because many statistical methods and machine learning
algorithms are sensitive to the magnitude of features — large-scale variables can dominate
small-scale variables in computations like distance measurement or gradient updates.
1. Standardization (Z-score Scaling)
2. Normalization (Min-Max Scaling)
Standardization (Z-score Scaling)
Definition:
Standardization, often referred to as Z-score scaling, is a feature scaling technique that
transforms numerical data to have a mean of 0 and a standard deviation of 1. This process
helps put features on a common scale without distorting differences in the ranges of values,
which is important for many statistical analyses and machine learning algorithms.
The Z-score of a data point xx is calculated by subtracting the population
mean μ from the value and then dividing by the population standard deviation σ

Advantages:
 Makes variables unitless (removes effect of different measurement units).
 Preserves the shape of the original distribution.
 Suitable for algorithms that assume normal distribution of features (e.g., Linear
Regression, PCA).
Disadvantages:
 Does not bound values to a fixed range.
 Can still be affected by extreme outliers.
Example Python Code:
OUTPUT

Normalization (Min-Max Scaling)

Definition:
Normalization (Min-Max Scaling) is a feature scaling method that transforms numerical
values into a fixed range, usually [0, 1], by linearly rescaling the data.
This ensures that the smallest value becomes 0 and the largest becomes 1, with all other
values proportionally adjusted in between.

Advantages:
 All features lie within the same range, preventing large-scale features from
dominating.
 Ideal for algorithms that use distance-based measures (e.g., KNN, K-means, Neural
Networks).
Disadvantages:
 Sensitive to outliers — an extreme value can significantly shrink other values in the
range.
 Does not change the shape of the distribution.

Example Python Code:

OUTPUT

Removing inconsistencies
Definition:
Removing inconsistencies is the process of detecting and correcting contradictory,
illogical, or irregular data entries so that the dataset is accurate, consistent, and reliable for
analysis. Inconsistencies often arise due to human error, multiple data sources, or system
glitches.
Output:

Feature Engineering
No ratings yet
Feature Engineering
15 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
ISAT 600 Progress Report 2
No ratings yet
ISAT 600 Progress Report 2
6 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
71 pages
Unit 1 (Data Collection and Management)
No ratings yet
Unit 1 (Data Collection and Management)
43 pages
Unit 2 Notes - Docx-3
No ratings yet
Unit 2 Notes - Docx-3
14 pages
Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
Data Cleaning
No ratings yet
Data Cleaning
39 pages
ML Notes
No ratings yet
ML Notes
44 pages
Ads Exp2
No ratings yet
Ads Exp2
3 pages
Data Preprocessing and Feature Engineering
No ratings yet
Data Preprocessing and Feature Engineering
32 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
Section 4
No ratings yet
Section 4
3 pages
PS ML Lect 5 9 Unit 2
No ratings yet
PS ML Lect 5 9 Unit 2
114 pages
Data Processing
No ratings yet
Data Processing
19 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
No ratings yet
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
35 pages
Data - Preprocessing - 2
No ratings yet
Data - Preprocessing - 2
10 pages
Module 3 Data Preparation
No ratings yet
Module 3 Data Preparation
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Data Preprocessing PT 2
No ratings yet
Data Preprocessing PT 2
7 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
Experiment No. 5: Objective
No ratings yet
Experiment No. 5: Objective
5 pages
Handling Missing Data in Categorical Features
No ratings yet
Handling Missing Data in Categorical Features
7 pages
ML Self Unit 2
No ratings yet
ML Self Unit 2
20 pages
DS 1
No ratings yet
DS 1
20 pages
Missing Data
No ratings yet
Missing Data
14 pages
Study Material Data Preprocessing
No ratings yet
Study Material Data Preprocessing
11 pages
CS3352 Foundations of Data Science Apr May 2024 Question Paper Download
No ratings yet
CS3352 Foundations of Data Science Apr May 2024 Question Paper Download
19 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
69 pages
Comparison of Imputation Techniques After Classifying The Dataset Using KNN Classifier For The Imputation of Missing Data
No ratings yet
Comparison of Imputation Techniques After Classifying The Dataset Using KNN Classifier For The Imputation of Missing Data
4 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Dunit I-Part-2
No ratings yet
Dunit I-Part-2
82 pages
ET 610 - Data Preprocessing
No ratings yet
ET 610 - Data Preprocessing
41 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
43 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
C2 - Data Cleaning & Preprocessing
No ratings yet
C2 - Data Cleaning & Preprocessing
59 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
Lecture 4 New Data Pre Processing
No ratings yet
Lecture 4 New Data Pre Processing
41 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
11 pages
Missing Data
100% (2)
Missing Data
35 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
PMA Unit-2 PDF
No ratings yet
PMA Unit-2 PDF
19 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
DMDW 03
No ratings yet
DMDW 03
25 pages
Data Imputation Techniques Guide
No ratings yet
Data Imputation Techniques Guide
6 pages
3 - Missing Values-1
No ratings yet
3 - Missing Values-1
9 pages
Week 10
No ratings yet
Week 10
50 pages
My Notes
No ratings yet
My Notes
15 pages
07 GNP of India
No ratings yet
07 GNP of India
6 pages
Quiz Questions
No ratings yet
Quiz Questions
2 pages
Unit II 07 Numpy
No ratings yet
Unit II 07 Numpy
6 pages
Unit II 04 Functions and Modules
No ratings yet
Unit II 04 Functions and Modules
7 pages
Sustainable Development Goals Overview
No ratings yet
Sustainable Development Goals Overview
25 pages
Loan Sanction Letter
No ratings yet
Loan Sanction Letter
8 pages
Dorian Tool Guide No 4
No ratings yet
Dorian Tool Guide No 4
72 pages
user-manual-ROLAND-E-66-E - (2013-02-21 - 18-02 (0674c5) )
No ratings yet
user-manual-ROLAND-E-66-E - (2013-02-21 - 18-02 (0674c5) )
2 pages
Electrolux Dishwasher Guide
No ratings yet
Electrolux Dishwasher Guide
20 pages
UNICOMGPI Procedure
No ratings yet
UNICOMGPI Procedure
11 pages
Retrieve Dragline
No ratings yet
Retrieve Dragline
21 pages
Shallow Vs Deep Nns Dse 3151 Deep Learning
No ratings yet
Shallow Vs Deep Nns Dse 3151 Deep Learning
591 pages
Nen Dynamic Ground Properties and Ec8 2024 12 11 Start
No ratings yet
Nen Dynamic Ground Properties and Ec8 2024 12 11 Start
11 pages
Management Report On Klook
No ratings yet
Management Report On Klook
14 pages
Inqvest Final Paper Ish
No ratings yet
Inqvest Final Paper Ish
65 pages
TxDOT Fabric Underseal Guide
No ratings yet
TxDOT Fabric Underseal Guide
4 pages
0003 Mi20 00S1 0230 0
100% (1)
0003 Mi20 00S1 0230 0
57 pages
Governance, Risk Management, Compliances and Ethics: Study Material
No ratings yet
Governance, Risk Management, Compliances and Ethics: Study Material
606 pages
Jackie Robinson for 2nd Graders
No ratings yet
Jackie Robinson for 2nd Graders
2 pages
High-Efficiency Bifacial Solar Modules
No ratings yet
High-Efficiency Bifacial Solar Modules
2 pages
BNE Tutorial 12-1
No ratings yet
BNE Tutorial 12-1
8 pages
ACCA Financial Reporting Mock Exam 2020
No ratings yet
ACCA Financial Reporting Mock Exam 2020
20 pages
DofE Bronze Initial Letter 2024
No ratings yet
DofE Bronze Initial Letter 2024
2 pages
Food Export Booklet
No ratings yet
Food Export Booklet
44 pages
PA 206 - Public Policy and Program Administration
50% (2)
PA 206 - Public Policy and Program Administration
10 pages
Employee Probationary Period Policy
No ratings yet
Employee Probationary Period Policy
3 pages
Flow Diagram of Some Industrial Process
No ratings yet
Flow Diagram of Some Industrial Process
23 pages
Atty. VLSalido - Anti Money Laundering
100% (1)
Atty. VLSalido - Anti Money Laundering
108 pages
Training Report of Sonaklika International Hoshiarpur
No ratings yet
Training Report of Sonaklika International Hoshiarpur
18 pages
Part 3 - Key - Scripts
No ratings yet
Part 3 - Key - Scripts
3 pages
Sony Str-k7100 Ver1.0
No ratings yet
Sony Str-k7100 Ver1.0
74 pages
Learning To Fly
No ratings yet
Learning To Fly
18 pages
Wa0025.
No ratings yet
Wa0025.
8 pages
MA KR CS Box-2 en
No ratings yet
MA KR CS Box-2 en
123 pages

Unit II 10 Data Preprocessing Techniques

Uploaded by

Unit II 10 Data Preprocessing Techniques

Uploaded by

Data Preprocessing Techniques

Data Preprocessing is the process of transforming raw, unorganized, and often

Handling Missing Data

Advantages of imputation methods

Disadvantages of imputation methods

Choosing between imputation and deletion methods

Encoding categorical variables (Encoding Categorical Data)

Encoding Suitable Data Key Advantage Key Common Use

One-Hot No ordinality High Linear models,

Ordinal Preserves Assumes linearity Ordered

Scaling Numerical Features (Feature Scaling)

Normalization (Min-Max Scaling)

Example Python Code:

You might also like