Data Preprocessing Technique
Data preprocessing in machine learning is the process of transforming raw data into a clean,
structured, and usable format that algorithms can effectively learn from. It's a crucial step because
raw data often contains inconsistencies, errors, and irrelevant information that can negatively impact
model accuracy and performance.
Data preprocessing is a important step in the data science transforming raw data into a clean
structured format for analysis. It involves tasks like handling missing values, normalizing data
and encoding variables. Pre-processing refers to the transformations applied to data before
feeding it to the algorithm. Data cleaning is a important step in the machine learning
(ML) pipeline as it involves identifying and removing any missing duplicate or irrelevant data.
The goal of data cleaning is to ensure that the data is accurate, consistent and free of errors
as raw data is often noisy, incomplete and inconsistent which can negatively impact the
accuracy of model and its reliability of insights derived from it. Professional data scientists
usually invest a large portion of their time in this step because of the belief that
ey aspects of data cleaning in machine learning:
Identifying and handling missing values:
This involves techniques like imputation (filling missing values with mean, median, or mode) or
removing rows/columns with excessive missing data.
Removing duplicates:
Eliminating redundant entries to prevent bias and improve model efficiency.
Addressing inconsistencies:
Standardizing data formats, capitalization, and other inconsistencies to ensure uniformity.
Detecting and correcting errors:
Identifying and rectifying incorrect data entries, outliers, or invalid values.
Data transformation:
Converting data into a suitable format for the chosen machine learning model (e.g., scaling
numerical features, one-hot encoding categorical variables).
Why is data cleaning important in machine learning?
Improved model accuracy:
Clean data provides a solid foundation for training machine learning models, leading to more
accurate predictions.
Better model generalization:
Clean data helps models generalize well to unseen data, making them more robust and reliable.
Reduced bias and variance:
By removing inconsistencies and errors, data cleaning minimizes bias and variance in the model,
leading to fairer and more accurate predictions.
Enhanced model interpretability:
Clean data makes it easier to understand the relationships between features and the target
variable, improving model interpretability.
Time and resource savings:
Clean data reduces the need for extensive model tuning and debugging, saving time and
computational resources.
1) Data Cleanling
How to Perform Data Cleanliness?
The process begins by thorough understanding data and its structure to identify issues like
missing values, duplicates and outliers. Performing data cleaning involves a systematic process to
identify and remove errors in a dataset. The following are essential steps to perform data cleaning.
Removal of Unwanted Observations: Identify and remove irrelevant or redundant (unwanted)
observations from the dataset. This step involves analyzing data entries for duplicate records,
irrelevant information or data points that do not contribute to analysis and prediction. Removing
them from dataset helps reducing noise and improving the overall quality of dataset.
Fixing Structure errors: Address structural issues in the dataset such as inconsistencies in
data formats or variable types. Standardize formats ensure uniformity in data structure and
hence data consistency.
Managing outliers: Outliers are those points that deviate significantly from dataset mean.
Identifying and managing outliers significantly improve model accuracy as these extreme values
influence analysis. Depending on the context decide whether to remove outliers or transform
them to minimize their impact on analysis.
Handling Missing Data: To handle missing data effectively we need to impute missing values
based on statistical methods, removing records with missing values or employing advanced
imputation techniques. Handling missing data helps preventing biases and maintaining the
integrity of data.
Handling Missing Data
Missing data is a common issue in real-world datasets and it can occur due to various reasons
such as human errors, system failures or data collection issues. Various techniques can be used
to handle missing data, such as imputation, deletion or substitution.
Let's check the missing values columns-wise for each row using df.isnull() it checks whether the
values are null or not and gives returns boolean values and sum() will sum the total number of null
values rows and we divide it by the total number of rows present in the dataset then we multiply to
get values in i.e per 100 values how much values are null.
The sklearn.preprocessing package provides several common utility functions and transformer classes to
change raw feature vectors into a representation that is more suitable for the downstream estimator
Imputation is the act of replacing missing data with statistical estimates of the missing
values. The goal of any imputation technique is to produce a complete dataset that can then
be then used for machine learning.
Data imputation is the process of filling in missing data points to ensure that the dataset is
complete and ready for analysis. In this blog, we’ll explore some common and advanced
data imputation techniques like mean, median, mode imputation. Mean/median imputation
consists of replacing all occurrences of missing values (NA) within a variable by the mean (if
the variable has a Gaussian distribution) or median (if the variable has a skewed
distribution).
Mean Imputation
Mean imputation is a simple and widely used technique for handling missing data. It involves replacing
missing values with the mean (average) of the observed values for that particular feature.
Example: If the feature Age has missing values, you calculate the mean age from the available data and
replace missing values with this mean.
Pros:
Simple and quick to implement.
Works well when the data is MCAR and not skewed.
2. Median Imputation
Median imputation is similar to mean imputation but uses the median instead of the mean. This is
useful when the data is skewed, as the median is less affected by outliers than the mean.
Example: For a feature like Income, which may be skewed by a few very high values, median imputation
can be more appropriate.
Pros:
More robust to outliers than mean imputation.
Preserves the data’s central tendency better in skewed distributions.
Assumptions
Mean/median imputation has the assumption that the data are missing completely at random (MCAR).
If this is the case, we can think of replacing the NA with the most frequent occurrence of the variable,
which is the mean if the variable has a Gaussian distribution or the median otherwise.
The rationale is to replace the population of missing values with the most frequent value, since this is
the most likely occurrence.
Advantages
1. Easy to implement
2. Fast way of obtaining complete data
2. Data Transformation
What is Data Transformation
Data transformation is the process of converting raw data into a more suitable format or structure
for analysis, to improve its quality and make it compatible with the requirements of a particular task
or system
Data Transformation in Machine Learning
Data transformation is the most important step in a machine learning pipeline which includes
modifying the raw data and converting it into a better format so that it can be more suitable for
analysis and model training purposes. In data transformation, we usually deal with issues such as
noise, missing values, outliers, and non-normality.
Why Data Transformation is Important?
Data transformation is crucial in the data analysis and machine learning pipeline as it plays an
important role in preparing raw data for meaningful insights and accurate model building. Raw
data, often sourced from diverse channels, may be inconsistent, contain missing values, or exhibit
variations that could impact the reliability of analyses.
Data transformation addresses these challenges by cleaning, encoding, and structuring the data in
a manner that makes it compatible with analytical tools and algorithms.
Additionally, data transformation facilitates feature engineering, allowing the creation of new
variables that may improve model performance. By converting data into a more suitable format,
ensures that models are trained on high-quality, standardized data, leading to more reliable
predictions and valuable insights.
Different Data Transformation Technique
The choice of data transformation technique depends on the characteristics of the data and the
machine learning algorithm that we intend to use on the data. Here are the mentioned techniques
discussed in details.
1. Normalization and Standardization
Standardization and Normalization are two of the most common techniques used in data
transformation which aims to scale and transform the data such that the features have similar
scales, which makes it easy for the machine learning algorithm to learn and converge.
Standardization:
Standardardization is also known as z-score normalization, the objective of standardization is to
transform the feature such that the value of mean becomes 0 and the value of standard deviation
becomes 1. Standardization is usually useful when features have different scales but follow normal
distribution.Standardization, also known as Z-score scaling or zero-mean scaling, is a common
method used in data preprocessing to scale and center features in machine learning. This
method transforms the data in a way that makes it suitable for algorithms that assume a
standard normal distribution. Standardization makes the data more Gaussian-like, which is
useful for some machine-learning algorithms. The new data point after standardization
becomes:
Here, mean(X) and std(X) are the mean and standard deviation of feature respectively.
Xi is the original feature.
Xi’ is the scaled feature.
mean(x) represents the mean (average) of all data points (features) in the dataset.
std(x) represents the standard deviation of all data points (features) in the dataset.
Explanation
1. Calculate Mean and Standard Deviation: For each feature, you calculate the mean
(average) and standard deviation. These statistics are used to determine the center and
the spread of the data.
2. Subtract the Mean: You subtract the mean of each feature from every data point. This
operation centers the data, making the new mean of the feature 0.
3. Divide by the Standard Deviation: You divide each data point by the standard deviation of
the feature. This scaling operation makes the standard deviation of the feature 1.
Normalization:
Normalization:
The main objective of normalization is to rescale the features to a standard range of values which
is usually 0-1. Normalization is usually used when different features have different range of values
and some feature might contribute more to the model learning process, normalization helps in
equalizing the range of the features and makes sure that the features contribute equally to the
learning
Algorithm.
Min-MaxScaling.
Normalization is the process of transforming the features (variables) in a dataset to a
common scale, typically within the range of (0, 1) or (-1, 1). The objective of normalization is
to ensure that all features have similar scales, which helps prevent certain features from
dominating the modeling process due to their larger numerical values.
We have already discussed that it will help in rescaling the data into a common scale in a
range of (0, 1) or (-1, 1).
Advantages:
Simple and intuitive method.
Preserves the relationships between data points.
Suitable for algorithms that assume data within a bounded range.
Encoding Categorical Variables
Many a times some features of a dataset are labeled as of different categories, but most of the
machine learning algorithms works better on numeric data feature as compared to any different
data type feature. Therefore, encoding of categorical features becomes an important step of data
transformation. The categorical features could be encoded into numerical valued features in
different ways.
Some of the most common encoding techniques:
Ordinal Encoding:
Ordinal Encoding assigns a unique numeric value to different categories in the same feature. In
ordinal encoding the categorical feature is encoded according to some sort of hierarchy in the
system, For example if there are three categories in the categorical feature named - "High-
School", "Bachelor's", and "Master's" the ordinal encoding will label this as 0, 1, 2 based on the
educational hierarchy. for example if there is a feature called size and it contains three values -
'small', 'medium', 'large', then the Ordinal encoding could label each value as 0, 1, 2 respectively.
Nominal Encoding : The technique used for encoding nominal categorical data is known as
One-Hot Encoding:
One-Hot Encoding is the most common encoding techniques used in data transformation for
nominal categorical data, what it does is that it converts each category in a categorical feature into
a different binary feature(i.e. 0 or 1), for example if there is a feature called 'vehicle' in the dataset
and the categories in it are 'car', 'bike', 'bicycle', one-hot encoding will create three separate
columns as 'is_car', 'is_bike', 'is_bicycle' and then label them as 0 if absent or 1 if present.
How One-Hot Encoding Works: An Example
To grasp the concept better let's explore a simple example. Imagine we have a dataset with fruits
their categorical values and corresponding prices. Using one-hot encoding we can transform these
categorical values into numerical form. For example:
Wherever the fruit is "Apple," the Apple column will have a value of 1 while the other fruit
columns (like Mango or Orange) will contain 0.
This pattern ensures that each categorical value gets its own column represented with binary
values (1 or 0) making it usable for machine learning models.
The output after applying one-hot encoding on the data is given as follows,
So, <1 0 0> will represent the fruit “Apple”.
< 0 1 0> will represent the fruit “Mango”.
< 0 0 1> will represent the fruit “Orange”.
Advantages and Disadvantages of One Hot Encoding
Advantages of Using One Hot Encoding
1. It allows the use of categorical variables in models that require numerical input.
2. It can improve model performance by providing more information to the model about the
categorical variable.
3. It can help to avoid the problem of ordinality which can occur when a categorical variable has a
natural ordering (e.g. "small", "medium", "large").
Disadvantages of Using One Hot Encoding
1. It can lead to increased dimensionality as a separate column is created for each category in the
variable. This can make the model more complex and slow to train.
2. It can lead to sparse data as most observations will have a value of 0 in most of the one-hot
encoded columns.
3. It can lead to overfitting especially if there are many categories in the variable and the sample
size is relatively small.
Feature Creation
The process of creating new features or modifying the existing feature to improve the performance
of machine learning model is called feature engineering. It helps in creating more informative and
effective representation of patterns present in data by combining and transforming the given
features. Through feature engineering we can increase our model performance and generalization
ability.
Advantages and Disadvantages of Data Transformation
There are several advantages of using data transformation, but with positive points there are also
negative points that we must pay attention to such that we can achieve our goals that we have set
from each project in hand. Some of the advantages as well as disadvantages of data
transformation that we must pay attention to such that we can best use our knowledge to improve
our model's performance:
Advantages
Improved Model Performance: The model gets better at generalizing new data when the
issues in data are resolved through data transformation.
Handling Missing Data: Results into good increase in the accuracy of the model.
Better Convergence: Data normalization and standardization results into better convergence
of the model during it's training period.
Dimensionality Reduction: Simplifies the model training process.
Better Insights from Feature Engineering
Disadvantages:
Information Loss: If the transformation is not within limit valuable details might get discarded
leading to a failed model.
Data Leakage: Applying transformation inappropriately using the entire dataset including the
test data can lead to overestimation of model performance.
Increased Complexity
Assumption Violation: Sometimes transformation might not align with the assumptions of the
chosen machine learning algorithm, which might lead to bad performance of the model in
general.
Thus we can say that data transformation is a crucial step in machine learning, it requires careful
consideration before deciding which techniques to apply to the data. Data transformation can
improve model performance. Cross-validation and evaluating model performance on unseen data
are essential steps to ensure that the chosen data transformation techniques are appropriate and
effective for the given task.
3) Data/Feature Reduction
Two techniques are there; Feature Selection and Feature
Extraction
Feature Extraction
Feature extraction is a crucial step in the machine learning pipeline, enabling the development of
more effective and efficient models by transforming raw data into a more meaningful and
manageable representation.
Why is Feature Extraction Important?
Improved Model Performance:
Feature extraction can lead to better accuracy and generalization by focusing on the most relevant
information, reducing noise, and simplifying the model's task.
Reduced Dimensionality:
By extracting key features, it can reduce the number of variables, making the data easier to process
and analyze, especially in high-dimensional datasets.
Increased Efficiency:
Processing fewer, more informative features can lead to faster training times and better
computational efficiency.
Enhanced Interpretability:
Extracted features can be more understandable and interpretable than raw data, making it easier to
understand the underlying patterns and relationships.
Key Concepts in Feature Extraction:
Dimensionality Reduction: Reducing the number of features while retaining essential information,
like in techniques such as Principal Component Analysis (PCA).
Introduction to Dimensionality/Feature Reduction
Dimensionality reduction helps to reduce the number of features while retaining key information.
Techniques like principal component analysis (PCA), singular value decomposition
(SVD) and linear discriminant analysis (LDA) convert data into a lower-dimensional space while
preserving important details.
Principal Component Analysis (PCA) is a powerful technique used for feature extraction, particularly
in situations involving high-dimensional data.
It transforms the original features into a new set of uncorrelated features called principal
components, which capture the maximum variance in the data. By selecting a subset of these
principal components (those with the highest variance), PCA effectively reduces dimensionality while
retaining the most important information. This process is often used as a preprocessing step for
machine learning algorithms, improving their efficiency and performance.
Advantages of Dimensionality Reduction
As seen earlier high dimensionality makes models inefficient. Let's now summarize the key
advantages of reducing dimensionality.
Faster Computation: With fewer features machine learning algorithms can process data more
quickly. This results in faster model training and testing which is particularly useful when
working with large datasets.
Better Visualization: As we saw in the earlier figure reducing dimensions makes it easier to
visualize data and reveal hidden patterns.
Prevent Overfitting: With few features models are less likely to memorize the training data and
overfit. This helps the model generalize better to new, unseen data improve its ability to make
accurate predictions.
4) Outlier Detection and Treatment
What is Outlier?
An outlier is essentially a statistical anomaly, a data point that significantly deviates from other
observations in a dataset. Outliers can arise due to measurement errors, natural variation, or rare
events.
Common Techniques Used for Detection Outliers
Outlier detection is a critical task in data analysis, crucial for ensuring the quality and reliability of
conclusions drawn from data. Different techniques are tailored for varying data types and
scenarios, ranging from statistical methods for general data sets to specialized algorithms for
spatial and temporal data. Such Techniques are:
Standard Deviation Method
Standard Deviation Method is based on the assumption that data follows a normal distribution.
Outliers are defined as those observations that lie beyond a specified number of standard
deviations away from the mean. Typically, data points outside of three standard deviations from
the mean are considered outliers.
Z-Score Method
The Z-score method calculates the number of standard deviations each data point is from the
mean. A Z-score threshold is set, commonly 3, and any data point with a Z-score exceeding this
threshold is considered an outlier. This method assumes a normal distribution and is sensitive to
extreme values in small datasets.
Outlier treatment refers to the methods used to handle unusual data points that deviate significantly
from the rest of the dataset. These methods aim to either minimize the impact of outliers on analysis
or remove them entirely, depending on the context and the nature of the outliers. Common
approaches include removing outliers, transforming the data, using robust statistical methods, or
replacing outliers with more representative values
Common Outlier Treatment Methods:
Removing Outliers:
This involves deleting the outlier data points from the dataset. This is a straightforward approach,
but it can lead to a loss of information if the outliers are meaningful or if there are a large number of
them.
Transforming Data:
Applying mathematical functions like logarithms, square roots, or reciprocals to the data can reduce
the influence of outliers and make the data distribution more normal.