0% found this document useful (0 votes)

9 views11 pages

Study Material Data Preprocessing

Uploaded by

texhnology1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views11 pages

Study Material Data Preprocessing

Uploaded by

texhnology1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Data Preprocessing Technique

Data preprocessing in machine learning is the process of transforming raw data into a clean,
structured, and usable format that algorithms can effectively learn from. It's a crucial step because
raw data often contains inconsistencies, errors, and irrelevant information that can negatively impact
model accuracy and performance.

Data preprocessing is a important step in the data science transforming raw data into a clean
structured format for analysis. It involves tasks like handling missing values, normalizing data
and encoding variables. Pre-processing refers to the transformations applied to data before
feeding it to the algorithm. Data cleaning is a important step in the machine learning
(ML) pipeline as it involves identifying and removing any missing duplicate or irrelevant data.
The goal of data cleaning is to ensure that the data is accurate, consistent and free of errors
as raw data is often noisy, incomplete and inconsistent which can negatively impact the
accuracy of model and its reliability of insights derived from it. Professional data scientists
usually invest a large portion of their time in this step because of the belief that

ey aspects of data cleaning in machine learning:

 Identifying and handling missing values:
This involves techniques like imputation (filling missing values with mean, median, or mode) or
removing rows/columns with excessive missing data.
 Removing duplicates:
Eliminating redundant entries to prevent bias and improve model efficiency.
 Addressing inconsistencies:
Standardizing data formats, capitalization, and other inconsistencies to ensure uniformity.
 Detecting and correcting errors:
Identifying and rectifying incorrect data entries, outliers, or invalid values.
 Data transformation:
Converting data into a suitable format for the chosen machine learning model (e.g., scaling
numerical features, one-hot encoding categorical variables).

Why is data cleaning important in machine learning?

 Improved model accuracy:
Clean data provides a solid foundation for training machine learning models, leading to more
accurate predictions.
 Better model generalization:
Clean data helps models generalize well to unseen data, making them more robust and reliable.
 Reduced bias and variance:
By removing inconsistencies and errors, data cleaning minimizes bias and variance in the model,
leading to fairer and more accurate predictions.
 Enhanced model interpretability:
Clean data makes it easier to understand the relationships between features and the target
variable, improving model interpretability.
 Time and resource savings:
Clean data reduces the need for extensive model tuning and debugging, saving time and
computational resources.

1) Data Cleanling
How to Perform Data Cleanliness?
The process begins by thorough understanding data and its structure to identify issues like
missing values, duplicates and outliers. Performing data cleaning involves a systematic process to
identify and remove errors in a dataset. The following are essential steps to perform data cleaning.

 Removal of Unwanted Observations: Identify and remove irrelevant or redundant (unwanted)

observations from the dataset. This step involves analyzing data entries for duplicate records,
irrelevant information or data points that do not contribute to analysis and prediction. Removing
them from dataset helps reducing noise and improving the overall quality of dataset.
 Fixing Structure errors: Address structural issues in the dataset such as inconsistencies in
data formats or variable types. Standardize formats ensure uniformity in data structure and
hence data consistency.
 Managing outliers: Outliers are those points that deviate significantly from dataset mean.
Identifying and managing outliers significantly improve model accuracy as these extreme values
influence analysis. Depending on the context decide whether to remove outliers or transform
them to minimize their impact on analysis.
 Handling Missing Data: To handle missing data effectively we need to impute missing values
based on statistical methods, removing records with missing values or employing advanced
imputation techniques. Handling missing data helps preventing biases and maintaining the
integrity of data.

Handling Missing Data

Missing data is a common issue in real-world datasets and it can occur due to various reasons
such as human errors, system failures or data collection issues. Various techniques can be used
to handle missing data, such as imputation, deletion or substitution.
Let's check the missing values columns-wise for each row using df.isnull() it checks whether the
values are null or not and gives returns boolean values and sum() will sum the total number of null
values rows and we divide it by the total number of rows present in the dataset then we multiply to
get values in i.e per 100 values how much values are null.

The sklearn.preprocessing package provides several common utility functions and transformer classes to
change raw feature vectors into a representation that is more suitable for the downstream estimator

Imputation is the act of replacing missing data with statistical estimates of the missing
values. The goal of any imputation technique is to produce a complete dataset that can then
be then used for machine learning.

Data imputation is the process of filling in missing data points to ensure that the dataset is
complete and ready for analysis. In this blog, we’ll explore some common and advanced
data imputation techniques like mean, median, mode imputation. Mean/median imputation
consists of replacing all occurrences of missing values (NA) within a variable by the mean (if
the variable has a Gaussian distribution) or median (if the variable has a skewed
distribution).

Mean Imputation

Mean imputation is a simple and widely used technique for handling missing data. It involves replacing
missing values with the mean (average) of the observed values for that particular feature.

Example: If the feature Age has missing values, you calculate the mean age from the available data and
replace missing values with this mean.

Pros:

 Simple and quick to implement.

 Works well when the data is MCAR and not skewed.

2. Median Imputation

Median imputation is similar to mean imputation but uses the median instead of the mean. This is
useful when the data is skewed, as the median is less affected by outliers than the mean.

Example: For a feature like Income, which may be skewed by a few very high values, median imputation
can be more appropriate.

Pros:

 More robust to outliers than mean imputation.

 Preserves the data’s central tendency better in skewed distributions.

Assumptions

Mean/median imputation has the assumption that the data are missing completely at random (MCAR).
If this is the case, we can think of replacing the NA with the most frequent occurrence of the variable,
which is the mean if the variable has a Gaussian distribution or the median otherwise.

The rationale is to replace the population of missing values with the most frequent value, since this is
the most likely occurrence.

Advantages

1. Easy to implement

2. Fast way of obtaining complete data

2. Data Transformation
What is Data Transformation

Data transformation is the process of converting raw data into a more suitable format or structure
for analysis, to improve its quality and make it compatible with the requirements of a particular task
or system

Data Transformation in Machine Learning

Data transformation is the most important step in a machine learning pipeline which includes
modifying the raw data and converting it into a better format so that it can be more suitable for
analysis and model training purposes. In data transformation, we usually deal with issues such as
noise, missing values, outliers, and non-normality.

Why Data Transformation is Important?

Data transformation is crucial in the data analysis and machine learning pipeline as it plays an
important role in preparing raw data for meaningful insights and accurate model building. Raw
data, often sourced from diverse channels, may be inconsistent, contain missing values, or exhibit
variations that could impact the reliability of analyses.
Data transformation addresses these challenges by cleaning, encoding, and structuring the data in
a manner that makes it compatible with analytical tools and algorithms.
Additionally, data transformation facilitates feature engineering, allowing the creation of new
variables that may improve model performance. By converting data into a more suitable format,
ensures that models are trained on high-quality, standardized data, leading to more reliable
predictions and valuable insights.

Different Data Transformation Technique

The choice of data transformation technique depends on the characteristics of the data and the
machine learning algorithm that we intend to use on the data. Here are the mentioned techniques
discussed in details.

1. Normalization and Standardization

Standardization and Normalization are two of the most common techniques used in data
transformation which aims to scale and transform the data such that the features have similar
scales, which makes it easy for the machine learning algorithm to learn and converge.

Standardization:

Standardardization is also known as z-score normalization, the objective of standardization is to

transform the feature such that the value of mean becomes 0 and the value of standard deviation
becomes 1. Standardization is usually useful when features have different scales but follow normal
distribution.Standardization, also known as Z-score scaling or zero-mean scaling, is a common
method used in data preprocessing to scale and center features in machine learning. This
method transforms the data in a way that makes it suitable for algorithms that assume a
standard normal distribution. Standardization makes the data more Gaussian-like, which is
useful for some machine-learning algorithms. The new data point after standardization
becomes:

Here, mean(X) and std(X) are the mean and standard deviation of feature respectively.

 Xi is the original feature.

 Xi’ is the scaled feature.

 mean(x) represents the mean (average) of all data points (features) in the dataset.

 std(x) represents the standard deviation of all data points (features) in the dataset.

Explanation

1. Calculate Mean and Standard Deviation: For each feature, you calculate the mean
(average) and standard deviation. These statistics are used to determine the center and
the spread of the data.

2. Subtract the Mean: You subtract the mean of each feature from every data point. This
operation centers the data, making the new mean of the feature 0.

3. Divide by the Standard Deviation: You divide each data point by the standard deviation of
the feature. This scaling operation makes the standard deviation of the feature 1.

Normalization:

Normalization:
The main objective of normalization is to rescale the features to a standard range of values which
is usually 0-1. Normalization is usually used when different features have different range of values
and some feature might contribute more to the model learning process, normalization helps in
equalizing the range of the features and makes sure that the features contribute equally to the
learning
Algorithm.
Min-MaxScaling.

Normalization is the process of transforming the features (variables) in a dataset to a

common scale, typically within the range of (0, 1) or (-1, 1). The objective of normalization is
to ensure that all features have similar scales, which helps prevent certain features from
dominating the modeling process due to their larger numerical values.

We have already discussed that it will help in rescaling the data into a common scale in a
range of (0, 1) or (-1, 1).

Advantages:

 Simple and intuitive method.

 Preserves the relationships between data points.

 Suitable for algorithms that assume data within a bounded range.

Encoding Categorical Variables

Many a times some features of a dataset are labeled as of different categories, but most of the
machine learning algorithms works better on numeric data feature as compared to any different
data type feature. Therefore, encoding of categorical features becomes an important step of data
transformation. The categorical features could be encoded into numerical valued features in
different ways.
Some of the most common encoding techniques:

Ordinal Encoding:

Ordinal Encoding assigns a unique numeric value to different categories in the same feature. In
ordinal encoding the categorical feature is encoded according to some sort of hierarchy in the
system, For example if there are three categories in the categorical feature named - "High-
School", "Bachelor's", and "Master's" the ordinal encoding will label this as 0, 1, 2 based on the
educational hierarchy. for example if there is a feature called size and it contains three values -
'small', 'medium', 'large', then the Ordinal encoding could label each value as 0, 1, 2 respectively.

Nominal Encoding : The technique used for encoding nominal categorical data is known as
One-Hot Encoding:

One-Hot Encoding is the most common encoding techniques used in data transformation for
nominal categorical data, what it does is that it converts each category in a categorical feature into
a different binary feature(i.e. 0 or 1), for example if there is a feature called 'vehicle' in the dataset
and the categories in it are 'car', 'bike', 'bicycle', one-hot encoding will create three separate
columns as 'is_car', 'is_bike', 'is_bicycle' and then label them as 0 if absent or 1 if present.

How One-Hot Encoding Works: An Example

To grasp the concept better let's explore a simple example. Imagine we have a dataset with fruits
their categorical values and corresponding prices. Using one-hot encoding we can transform these
categorical values into numerical form. For example:
 Wherever the fruit is "Apple," the Apple column will have a value of 1 while the other fruit
columns (like Mango or Orange) will contain 0.
 This pattern ensures that each categorical value gets its own column represented with binary
values (1 or 0) making it usable for machine learning models.
The output after applying one-hot encoding on the data is given as follows,

So, <1 0 0> will represent the fruit “Apple”.

< 0 1 0> will represent the fruit “Mango”.
< 0 0 1> will represent the fruit “Orange”.

Advantages and Disadvantages of One Hot Encoding

Advantages of Using One Hot Encoding

1. It allows the use of categorical variables in models that require numerical input.
2. It can improve model performance by providing more information to the model about the
categorical variable.
3. It can help to avoid the problem of ordinality which can occur when a categorical variable has a
natural ordering (e.g. "small", "medium", "large").

Disadvantages of Using One Hot Encoding

1. It can lead to increased dimensionality as a separate column is created for each category in the
variable. This can make the model more complex and slow to train.
2. It can lead to sparse data as most observations will have a value of 0 in most of the one-hot
encoded columns.
3. It can lead to overfitting especially if there are many categories in the variable and the sample
size is relatively small.

Feature Creation

The process of creating new features or modifying the existing feature to improve the performance
of machine learning model is called feature engineering. It helps in creating more informative and
effective representation of patterns present in data by combining and transforming the given
features. Through feature engineering we can increase our model performance and generalization
ability.
Advantages and Disadvantages of Data Transformation

There are several advantages of using data transformation, but with positive points there are also
negative points that we must pay attention to such that we can achieve our goals that we have set
from each project in hand. Some of the advantages as well as disadvantages of data
transformation that we must pay attention to such that we can best use our knowledge to improve
our model's performance:

Advantages
 Improved Model Performance: The model gets better at generalizing new data when the
issues in data are resolved through data transformation.
 Handling Missing Data: Results into good increase in the accuracy of the model.
 Better Convergence: Data normalization and standardization results into better convergence
of the model during it's training period.
 Dimensionality Reduction: Simplifies the model training process.
 Better Insights from Feature Engineering

Disadvantages:
 Information Loss: If the transformation is not within limit valuable details might get discarded
leading to a failed model.
 Data Leakage: Applying transformation inappropriately using the entire dataset including the
test data can lead to overestimation of model performance.
 Increased Complexity
 Assumption Violation: Sometimes transformation might not align with the assumptions of the
chosen machine learning algorithm, which might lead to bad performance of the model in
general.

Thus we can say that data transformation is a crucial step in machine learning, it requires careful
consideration before deciding which techniques to apply to the data. Data transformation can
improve model performance. Cross-validation and evaluating model performance on unseen data
are essential steps to ensure that the chosen data transformation techniques are appropriate and
effective for the given task.

3) Data/Feature Reduction
Two techniques are there; Feature Selection and Feature
Extraction

Feature Extraction
Feature extraction is a crucial step in the machine learning pipeline, enabling the development of
more effective and efficient models by transforming raw data into a more meaningful and
manageable representation.
Why is Feature Extraction Important?
 Improved Model Performance:
Feature extraction can lead to better accuracy and generalization by focusing on the most relevant
information, reducing noise, and simplifying the model's task.
 Reduced Dimensionality:
By extracting key features, it can reduce the number of variables, making the data easier to process
and analyze, especially in high-dimensional datasets.
 Increased Efficiency:
Processing fewer, more informative features can lead to faster training times and better
computational efficiency.
 Enhanced Interpretability:
Extracted features can be more understandable and interpretable than raw data, making it easier to
understand the underlying patterns and relationships.

Key Concepts in Feature Extraction:

 Dimensionality Reduction: Reducing the number of features while retaining essential information,
like in techniques such as Principal Component Analysis (PCA).

Introduction to Dimensionality/Feature Reduction

Dimensionality reduction helps to reduce the number of features while retaining key information.
Techniques like principal component analysis (PCA), singular value decomposition
(SVD) and linear discriminant analysis (LDA) convert data into a lower-dimensional space while
preserving important details.

Principal Component Analysis (PCA) is a powerful technique used for feature extraction, particularly
in situations involving high-dimensional data.

It transforms the original features into a new set of uncorrelated features called principal
components, which capture the maximum variance in the data. By selecting a subset of these
principal components (those with the highest variance), PCA effectively reduces dimensionality while
retaining the most important information. This process is often used as a preprocessing step for
machine learning algorithms, improving their efficiency and performance.

Advantages of Dimensionality Reduction

As seen earlier high dimensionality makes models inefficient. Let's now summarize the key
advantages of reducing dimensionality.
 Faster Computation: With fewer features machine learning algorithms can process data more
quickly. This results in faster model training and testing which is particularly useful when
working with large datasets.
 Better Visualization: As we saw in the earlier figure reducing dimensions makes it easier to
visualize data and reveal hidden patterns.
 Prevent Overfitting: With few features models are less likely to memorize the training data and
overfit. This helps the model generalize better to new, unseen data improve its ability to make
accurate predictions.

4) Outlier Detection and Treatment

What is Outlier?
An outlier is essentially a statistical anomaly, a data point that significantly deviates from other
observations in a dataset. Outliers can arise due to measurement errors, natural variation, or rare
events.
Common Techniques Used for Detection Outliers
Outlier detection is a critical task in data analysis, crucial for ensuring the quality and reliability of
conclusions drawn from data. Different techniques are tailored for varying data types and
scenarios, ranging from statistical methods for general data sets to specialized algorithms for
spatial and temporal data. Such Techniques are:

Standard Deviation Method

Standard Deviation Method is based on the assumption that data follows a normal distribution.
Outliers are defined as those observations that lie beyond a specified number of standard
deviations away from the mean. Typically, data points outside of three standard deviations from
the mean are considered outliers.

Z-Score Method
The Z-score method calculates the number of standard deviations each data point is from the
mean. A Z-score threshold is set, commonly 3, and any data point with a Z-score exceeding this
threshold is considered an outlier. This method assumes a normal distribution and is sensitive to
extreme values in small datasets.

Outlier treatment refers to the methods used to handle unusual data points that deviate significantly
from the rest of the dataset. These methods aim to either minimize the impact of outliers on analysis
or remove them entirely, depending on the context and the nature of the outliers. Common
approaches include removing outliers, transforming the data, using robust statistical methods, or
replacing outliers with more representative values

Common Outlier Treatment Methods:

 Removing Outliers:
This involves deleting the outlier data points from the dataset. This is a straightforward approach,
but it can lead to a loss of information if the outliers are meaningful or if there are a large number of
them.
 Transforming Data:
Applying mathematical functions like logarithms, square roots, or reciprocals to the data can reduce
the influence of outliers and make the data distribution more normal.

C2 - Data Cleaning & Preprocessing
No ratings yet
C2 - Data Cleaning & Preprocessing
59 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
Unit II (DWDM)
No ratings yet
Unit II (DWDM)
19 pages
Data Pre Processing I
No ratings yet
Data Pre Processing I
37 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
9 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
Data Pre-processing in Machine Learning
No ratings yet
Data Pre-processing in Machine Learning
84 pages
Data Pre-Processing & Cleaning Guide
No ratings yet
Data Pre-Processing & Cleaning Guide
37 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
15 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Unit 2 Preprocessing in Data Analytics
No ratings yet
Unit 2 Preprocessing in Data Analytics
36 pages
Data Mining
No ratings yet
Data Mining
22 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
ML ch-1
No ratings yet
ML ch-1
32 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Foundation of DS
No ratings yet
Foundation of DS
21 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Preprocessing: Clean, Transform, Integrate
No ratings yet
Data Preprocessing: Clean, Transform, Integrate
6 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Dmi Unit 3
No ratings yet
Dmi Unit 3
12 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Data Cleaning Preprocessing
No ratings yet
Data Cleaning Preprocessing
28 pages
02 - 23ECE216 - EDA - Pre Processing
No ratings yet
02 - 23ECE216 - EDA - Pre Processing
16 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Lec 3 Data Preprocessing and Transformation
No ratings yet
Lec 3 Data Preprocessing and Transformation
73 pages
Unit II - Data Preprocessing and Classification RSK-1
No ratings yet
Unit II - Data Preprocessing and Classification RSK-1
115 pages
Data Preprocessing
No ratings yet
Data Preprocessing
67 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
No ratings yet
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
88 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
DMDW Unit II
No ratings yet
DMDW Unit II
57 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Data Cleaning
No ratings yet
Data Cleaning
2 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Chap 1 Data Preprocessing
No ratings yet
Chap 1 Data Preprocessing
17 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Unit - II
No ratings yet
Unit - II
56 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
Data Cleaning Essentials
No ratings yet
Data Cleaning Essentials
42 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Westgard System and Rules
100% (1)
Westgard System and Rules
2 pages
Forecasting Errors
No ratings yet
Forecasting Errors
11 pages
Large Cap Mutual Funds in India
No ratings yet
Large Cap Mutual Funds in India
14 pages
Credit Risk Modeling Assignment
No ratings yet
Credit Risk Modeling Assignment
5 pages
Problem Set #8 Devi Cahyani (E1D118028)
No ratings yet
Problem Set #8 Devi Cahyani (E1D118028)
6 pages
Question Bank On Research Methodology
91% (32)
Question Bank On Research Methodology
16 pages
Final
No ratings yet
Final
11 pages
Data Editing & Validation Guide
No ratings yet
Data Editing & Validation Guide
19 pages
R-Lab-Manual-Final With Out-Put
100% (1)
R-Lab-Manual-Final With Out-Put
38 pages
European Normative Values For Physical Fitness in Children and Ad
No ratings yet
European Normative Values For Physical Fitness in Children and Ad
18 pages
Controlchart
No ratings yet
Controlchart
10 pages
Brief Visuospatial Memory Test-Revised (BVMT) Scoring
100% (1)
Brief Visuospatial Memory Test-Revised (BVMT) Scoring
92 pages
Estimating Sex of The Human Skeleton Based On Metrics of The Sternum
No ratings yet
Estimating Sex of The Human Skeleton Based On Metrics of The Sternum
7 pages
Casio Calculator User Guide
100% (3)
Casio Calculator User Guide
40 pages
Statistical Formulas (Business Stats)
No ratings yet
Statistical Formulas (Business Stats)
4 pages
Uncertainties in Measurement
No ratings yet
Uncertainties in Measurement
7 pages
Journal of Corporate Finance: Xiaorong Li, Steven Shuye Wang, Xue Wang T
No ratings yet
Journal of Corporate Finance: Xiaorong Li, Steven Shuye Wang, Xue Wang T
25 pages
Lean Six Sigma Black Belt Certification Training Manual
100% (1)
Lean Six Sigma Black Belt Certification Training Manual
299 pages
Interval Estimate of Population Proportion
No ratings yet
Interval Estimate of Population Proportion
21 pages
Assignments MA-14
No ratings yet
Assignments MA-14
1 page
A Practical Software Package For Computing Gravime
No ratings yet
A Practical Software Package For Computing Gravime
13 pages
Complete Business Statistics: Confidence Intervals
No ratings yet
Complete Business Statistics: Confidence Intervals
50 pages
Minitab Project Report
No ratings yet
Minitab Project Report
3 pages
Day 5 - CW - Normal Table Practice Problems
No ratings yet
Day 5 - CW - Normal Table Practice Problems
3 pages
Social Anxiety and Introversion in College Students PDF
No ratings yet
Social Anxiety and Introversion in College Students PDF
68 pages
Process Capability Report Form06
No ratings yet
Process Capability Report Form06
4 pages
Chapter 5: Continuous Probability Distributions: Department of Mathematics Izmir University of Economics
No ratings yet
Chapter 5: Continuous Probability Distributions: Department of Mathematics Izmir University of Economics
42 pages
Stat 22 SP 21 HW7 Solutions
No ratings yet
Stat 22 SP 21 HW7 Solutions
2 pages
BTEC Computing Math Verification
No ratings yet
BTEC Computing Math Verification
59 pages
2a EDA
No ratings yet
2a EDA
16 pages

Study Material Data Preprocessing

Uploaded by

Study Material Data Preprocessing

Uploaded by

Data Preprocessing Technique

ey aspects of data cleaning in machine learning:

Why is data cleaning important in machine learning?

 Removal of Unwanted Observations: Identify and remove irrelevant or redundant (unwanted)

Handling Missing Data

 Simple and quick to implement.

 Works well when the data is MCAR and not skewed.

 More robust to outliers than mean imputation.

 Preserves the data’s central tendency better in skewed distributions.

2. Fast way of obtaining complete data

Data Transformation in Machine Learning

Why Data Transformation is Important?

Different Data Transformation Technique

1. Normalization and Standardization

Standardardization is also known as z-score normalization, the objective of standardization is to

 Xi is the original feature.

 Xi’ is the scaled feature.

Normalization is the process of transforming the features (variables) in a dataset to a

 Simple and intuitive method.

 Preserves the relationships between data points.

 Suitable for algorithms that assume data within a bounded range.

Encoding Categorical Variables

How One-Hot Encoding Works: An Example

So, <1 0 0> will represent the fruit “Apple”.

Advantages and Disadvantages of One Hot Encoding

Advantages of Using One Hot Encoding

Disadvantages of Using One Hot Encoding

Key Concepts in Feature Extraction:

Introduction to Dimensionality/Feature Reduction

Advantages of Dimensionality Reduction

4) Outlier Detection and Treatment

Standard Deviation Method

Common Outlier Treatment Methods:

You might also like