[go: up one dir, main page]

0% found this document useful (0 votes)
10 views13 pages

Unit II 10 Data Preprocessing Techniques

Data preprocessing transforms raw data into a structured format suitable for analysis, involving tasks like handling missing values, encoding categorical variables, and scaling numerical features. Techniques for handling missing data include imputation methods, which fill in missing values, and deletion methods, which remove incomplete cases. Encoding categorical data converts categories into numerical form, while feature scaling ensures numerical variables are on a similar scale for accurate analysis.

Uploaded by

victor.seelan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views13 pages

Unit II 10 Data Preprocessing Techniques

Data preprocessing transforms raw data into a structured format suitable for analysis, involving tasks like handling missing values, encoding categorical variables, and scaling numerical features. Techniques for handling missing data include imputation methods, which fill in missing values, and deletion methods, which remove incomplete cases. Encoding categorical data converts categories into numerical form, while feature scaling ensures numerical variables are on a similar scale for accurate analysis.

Uploaded by

victor.seelan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Data Preprocessing Techniques

Data Preprocessing is the process of transforming raw, unorganized, and often


incomplete data into a clean, structured, and suitable format for analysis or modeling.
It involves tasks such as
1. Handling missing values
2. Encoding categorical variables (Encoding Categorical Data)
3. Scaling numerical features (Feature Scaling)
4. Removing inconsistencies, ensuring that the dataset is accurate, consistent, and
ready for statistical analysis or machine learning.

Handling Missing Data

Missing data occurs when no value is stored for a variable in an observation. If not addressed
properly, it can reduce statistical power, bias results, and affect model performance.
Techniques for Handling Missing Data
A. Imputation Methods
B. Deletion Methods

A. Imputation Methods
Imputation involves filling in missing values with estimated or plausible values based on other
available information in the dataset. This approach aims to retain the full dataset and prevent
information loss associated with removing incomplete cases.

Advantages of imputation methods


 Preserves sample size: Maintains the original dataset size, reducing potential loss of
valuable information.
 Reduces bias: If performed appropriately, especially for data that is Missing at
Random (MAR) or Missing Completely at Random (MCAR), imputation can help
avoid bias that could arise from deleting incomplete records.
 Allows for the use of standard analysis techniques: Once missing values are imputed,
standard statistical and machine learning methods can be applied to the complete
dataset.

Disadvantages of imputation methods


 Potential for introducing bias: If the imputation method is not appropriate or the
missing data mechanism is not understood, imputation can introduce bias and distort
the relationships between variables.
 Underestimation of standard errors: Some single imputation methods can
underestimate standard errors, potentially leading to overstating the precision of the
results.
 Increased complexity: Advanced imputation methods like multiple imputation can be
more complex to implement and may require specialized knowledge.
 Computationally intensive: Some advanced imputation methods can be
computationally demanding, particularly for large datasets.

These involve filling in missing values using estimates based on available data.
1. Mean Imputation: Replace with column mean (works for MCAR, symmetric data).
Replace missing values with the mean of the observed values of that variable. Simple
but can reduce variability.
2. Median Imputation: Replace with median (better for skewed data). Or Use the
median value, which is more robust to outliers than the mean.
3. Mode Imputation: Replace with most frequent category (categorical data). Or Use
the most frequent value, typically applied for categorical variables.
4. Advanced Methods: Regression Imputation, KNN Imputation and Multiple
Imputation

B. Deletion Methods
Remove records or variables with missing data. Deletion methods involve removing cases
(rows) or variables (columns) that contain missing values. While simpler to implement,
deletion can lead to a reduction in sample size and potentially introduce bias if the
missingness is not Missing Completely at Random (MCAR).
Advantages of deletion methods
 Simplicity and Ease of Implementation: Deletion methods are relatively
straightforward to understand and implement.
 Creates a complete dataset: The resulting dataset after deletion is free of missing
values, allowing for the use of standard analysis techniques designed for complete
datasets.
Disadvantages of deletion methods
 Loss of information: Deleting cases or variables with missing values can lead to a
significant reduction in the usable data, which can reduce the statistical power of the
analysis.
 Potential for introducing bias: If the missingness is related to the outcome or other
variables, deleting incomplete cases can introduce bias into the analysis.
 Reduced generalizability: A smaller sample size after deletion may lead to results
that are less generalizable to the broader population.
Types of deletion methods
 Listwise Deletion: Entire records (rows) with any missing values are removed from
the dataset.
 Pairwise Deletion: Uses available data for each analysis, ignoring missing data on a
case-by-case basis.

Choosing between imputation and deletion methods


The decision of whether to use imputation or deletion methods depends on several factors,
including:
 Amount of missing data: For a small amount of missing data, deletion might be
acceptable, particularly if the data is MCAR.
 Mechanism of missingness: Understanding why the data is missing (MCAR, MAR,
or MNAR) is crucial.
 Type of data (numerical or categorical): Some imputation methods are better suited
for specific data types.
 Research goals and desired level of accuracy: More complex and robust methods like
multiple imputation might be needed for analyses requiring high accuracy and valid
statistical inferences.

Encoding categorical variables (Encoding Categorical Data)


Many statistical models and machine learning algorithms require numerical inputs, but real-
world datasets often contain categorical variables (e.g., Gender, Color, City).
Encoding categorical data is the process of converting these categories into numerical form
while preserving the information they carry.
Encoding categorical data is a crucial step in data preprocessing, especially for statistical
and machine learning models that require numerical input. Here's a detailed review of
the main encoding methods:
Types of Categorical Data
 Nominal Data: Categories without any order (e.g., colors like red, blue, green).
 Ordinal Data: Categories with a meaningful order or ranking (e.g., low, medium,
high).
Common Encoding Techniques
 One-Hot Encoding
 Label Encoding
 Ordinal Encoding
 Effect Encoding (Deviation Encoding)
 Count (Frequency) Encoding

One-Hot Encoding
 Converts each category into a binary vector with one column per category.
 For each data point, the column corresponding to its category is set to 1, others
to 0.
 Ideal for nominal data because it does not imply any order.
 Disadvantage: increases dimensionality, which can lead to computational
challenges especially with high-cardinality features.
 Useful for linear models and algorithms that cannot handle categorical data
directly.
Label Encoding
 Assigns an integer value to each category.
 Suitable for ordinal data where the order matters.
 Simple and efficient.
 Risk: For nominal data, it may wrongly imply ordinality, leading to model
bias.
Ordinal Encoding
 Similar to label encoding but explicitly used for ordinal categories with a
meaningful order.
 The values reflect the order but can assume linear relationships which may not
always be appropriate.
Effect Encoding (Deviation Encoding)
 Uses values 1, 0, and -1 instead of binary 0 and 1.
 Helps handle multicollinearity in linear models and makes coefficients
interpretable.
 Used in linear regression and ANOVA models.
Count (Frequency) Encoding
 Converts categories into the frequency of occurrence in the dataset.
 Reduces dimensionality compared to one-hot encoding.
 Might lose distinctiveness between categories with the same frequency.

Encoding Suitable Data Key Advantage Key Common Use


Method Type Disadvantage Cases

One-Hot No ordinality High Linear models,


Encoding Nominal implied dimensionality neural networks

Ordinal features,
Label Simple and Misleading for tree-based
Encoding Ordinal preserves order nominal data models

Ordinal Preserves Assumes linearity Ordered


Encoding Ordinal meaningful order in order categories

Handles Linear
Effect multicollinearity regression,
Encoding Ordinal/Nominal well More complex ANOVA

High-cardinality
Count Reduces Loses category categorical
Encoding Nominal/Ordinal dimensionality distinctiveness features

Scaling Numerical Features (Feature Scaling)


Feature scaling is the process of transforming numerical variables so that they are on a
similar scale. It is important because many statistical methods and machine learning
algorithms are sensitive to the magnitude of features — large-scale variables can dominate
small-scale variables in computations like distance measurement or gradient updates.
1. Standardization (Z-score Scaling)
2. Normalization (Min-Max Scaling)
Standardization (Z-score Scaling)
Definition:
Standardization, often referred to as Z-score scaling, is a feature scaling technique that
transforms numerical data to have a mean of 0 and a standard deviation of 1. This process
helps put features on a common scale without distorting differences in the ranges of values,
which is important for many statistical analyses and machine learning algorithms.
The Z-score of a data point xx is calculated by subtracting the population
mean μ from the value and then dividing by the population standard deviation σ

Advantages:
 Makes variables unitless (removes effect of different measurement units).
 Preserves the shape of the original distribution.
 Suitable for algorithms that assume normal distribution of features (e.g., Linear
Regression, PCA).
Disadvantages:
 Does not bound values to a fixed range.
 Can still be affected by extreme outliers.
Example Python Code:
OUTPUT

Normalization (Min-Max Scaling)


Definition:
Normalization (Min-Max Scaling) is a feature scaling method that transforms numerical
values into a fixed range, usually [0, 1], by linearly rescaling the data.
This ensures that the smallest value becomes 0 and the largest becomes 1, with all other
values proportionally adjusted in between.

Advantages:
 All features lie within the same range, preventing large-scale features from
dominating.
 Ideal for algorithms that use distance-based measures (e.g., KNN, K-means, Neural
Networks).
Disadvantages:
 Sensitive to outliers — an extreme value can significantly shrink other values in the
range.
 Does not change the shape of the distribution.

Example Python Code:


OUTPUT

Removing inconsistencies
Definition:
Removing inconsistencies is the process of detecting and correcting contradictory,
illogical, or irregular data entries so that the dataset is accurate, consistent, and reliable for
analysis. Inconsistencies often arise due to human error, multiple data sources, or system
glitches.
Output:

You might also like