Unit II 10 Data Preprocessing Techniques
Unit II 10 Data Preprocessing Techniques
Missing data occurs when no value is stored for a variable in an observation. If not addressed
properly, it can reduce statistical power, bias results, and affect model performance.
Techniques for Handling Missing Data
A. Imputation Methods
B. Deletion Methods
A. Imputation Methods
Imputation involves filling in missing values with estimated or plausible values based on other
available information in the dataset. This approach aims to retain the full dataset and prevent
information loss associated with removing incomplete cases.
These involve filling in missing values using estimates based on available data.
1. Mean Imputation: Replace with column mean (works for MCAR, symmetric data).
Replace missing values with the mean of the observed values of that variable. Simple
but can reduce variability.
2. Median Imputation: Replace with median (better for skewed data). Or Use the
median value, which is more robust to outliers than the mean.
3. Mode Imputation: Replace with most frequent category (categorical data). Or Use
the most frequent value, typically applied for categorical variables.
4. Advanced Methods: Regression Imputation, KNN Imputation and Multiple
Imputation
B. Deletion Methods
Remove records or variables with missing data. Deletion methods involve removing cases
(rows) or variables (columns) that contain missing values. While simpler to implement,
deletion can lead to a reduction in sample size and potentially introduce bias if the
missingness is not Missing Completely at Random (MCAR).
Advantages of deletion methods
Simplicity and Ease of Implementation: Deletion methods are relatively
straightforward to understand and implement.
Creates a complete dataset: The resulting dataset after deletion is free of missing
values, allowing for the use of standard analysis techniques designed for complete
datasets.
Disadvantages of deletion methods
Loss of information: Deleting cases or variables with missing values can lead to a
significant reduction in the usable data, which can reduce the statistical power of the
analysis.
Potential for introducing bias: If the missingness is related to the outcome or other
variables, deleting incomplete cases can introduce bias into the analysis.
Reduced generalizability: A smaller sample size after deletion may lead to results
that are less generalizable to the broader population.
Types of deletion methods
Listwise Deletion: Entire records (rows) with any missing values are removed from
the dataset.
Pairwise Deletion: Uses available data for each analysis, ignoring missing data on a
case-by-case basis.
One-Hot Encoding
Converts each category into a binary vector with one column per category.
For each data point, the column corresponding to its category is set to 1, others
to 0.
Ideal for nominal data because it does not imply any order.
Disadvantage: increases dimensionality, which can lead to computational
challenges especially with high-cardinality features.
Useful for linear models and algorithms that cannot handle categorical data
directly.
Label Encoding
Assigns an integer value to each category.
Suitable for ordinal data where the order matters.
Simple and efficient.
Risk: For nominal data, it may wrongly imply ordinality, leading to model
bias.
Ordinal Encoding
Similar to label encoding but explicitly used for ordinal categories with a
meaningful order.
The values reflect the order but can assume linear relationships which may not
always be appropriate.
Effect Encoding (Deviation Encoding)
Uses values 1, 0, and -1 instead of binary 0 and 1.
Helps handle multicollinearity in linear models and makes coefficients
interpretable.
Used in linear regression and ANOVA models.
Count (Frequency) Encoding
Converts categories into the frequency of occurrence in the dataset.
Reduces dimensionality compared to one-hot encoding.
Might lose distinctiveness between categories with the same frequency.
Ordinal features,
Label Simple and Misleading for tree-based
Encoding Ordinal preserves order nominal data models
Handles Linear
Effect multicollinearity regression,
Encoding Ordinal/Nominal well More complex ANOVA
High-cardinality
Count Reduces Loses category categorical
Encoding Nominal/Ordinal dimensionality distinctiveness features
Advantages:
Makes variables unitless (removes effect of different measurement units).
Preserves the shape of the original distribution.
Suitable for algorithms that assume normal distribution of features (e.g., Linear
Regression, PCA).
Disadvantages:
Does not bound values to a fixed range.
Can still be affected by extreme outliers.
Example Python Code:
OUTPUT
Advantages:
All features lie within the same range, preventing large-scale features from
dominating.
Ideal for algorithms that use distance-based measures (e.g., KNN, K-means, Neural
Networks).
Disadvantages:
Sensitive to outliers — an extreme value can significantly shrink other values in the
range.
Does not change the shape of the distribution.
Removing inconsistencies
Definition:
Removing inconsistencies is the process of detecting and correcting contradictory,
illogical, or irregular data entries so that the dataset is accurate, consistent, and reliable for
analysis. Inconsistencies often arise due to human error, multiple data sources, or system
glitches.
Output: