0% found this document useful (0 votes)

15 views5 pages

7 Data Preprocessing Steps in Machine Learning

The document outlines seven essential data preprocessing steps in machine learning, including acquiring datasets, importing libraries, checking for missing values, encoding data, and scaling. It also discusses techniques for handling imbalanced data and best practices for data cleaning, reduction, transformation, enrichment, and validation. Additionally, it highlights the importance of data versioning for maintaining data integrity and compliance in machine learning projects.

Uploaded by

kamalapraba25

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views5 pages

7 Data Preprocessing Steps in Machine Learning

Uploaded by

kamalapraba25

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

7 Data Preprocessing Steps in Machine Learning

1. Acquire the Dataset

Naturally, data collection is the first step in any machine learning project and the first among the data
preprocessing steps. Gathering data might seem like a straightforward process, but it’s far from that.

Most companies end up with data kept in silos and divide it across many departments, teams, and
digital solutions. For example, the marketing team might have access to a CRM system, but that
system may operate in isolation from the web analytics solution. Combining all data streams into
consolidated storage will be challenging.

2. Import Libraries

Next, it’s time to import the libraries you’ll need for your machine learning project. A library is a
collection of functions that an algorithm can call and utilize.

You can streamline data preprocessing procedures using tools and frameworks that make the process
easier to organize and execute. Without certain libraries, one-liner solutions might take hours to
code and optimize.

3. Import Datasets

The next key step is to load the data that will be utilized in the machine learning algorithm. This is the
most critical machine learning preprocessing step.

Many companies start by storing data in warehouses that require data to pass through an ETL. The
problem with this method is that you never know which data will be useful for an ML project. As a
result, warehouses are commonly used to access data through business intelligence interfaces in
order to observe metrics that we know we need to monitor.

Data lakes are used for both structured and unstructured data, including photos, videos, voice
recordings, and PDF files. However, even when data is structured, it’s not transformed prior to
storage. You load the data in its present condition and then decide how to use and alter it later.

4. Check for Missing Values

Evaluate the data and look for missing values. Missing values can break actual data trends and
potentially result in additional data loss when entire rows and columns are deleted due to a few
missing cells in the dataset.

If you discover any, you can choose from two methods to deal with this issue:

 Remove the whole row with a missing value. However, eliminating the full row increases the
likelihood of losing some critical data. This strategy is beneficial if the dataset is massive.

 Estimate the value using the mean, median, or mode.

5. Encode the Data

Non-numerical data is incomprehensible to machine learning modules. To avoid issues later on, the
data should be arranged numerically. The answer to this problem is to convert all text values to
numerical form.

6. Scaling
Scaling is unnecessary for non-distance-based algorithms (such as the decision tree). Distance-based
models, on the other hand, require all features to be scaled.

These are some of the more common scaling approaches:

Scaling Description
Approach

Min-Max It reduces the feature values between any range of options (for example, between zero
Scaler and four)

Standard It assumes that the variable is normally distributed and then scales it down until the
Scaler standard deviation is one and the distribution is centered at zero

Robust Scaler It performs best when the dataset contains outliers. After eliminating the median, the
data is scaled based on the interquartile range

Max-Abs Similar to the min-max scaler, except instead of a certain range, the feature is scaled to
Scaler its greatest absolute value

7. Split Dataset Into Training, Evaluation and Validation Sets

This is the final step among the data preprocessing steps. It’s time to divide your dataset into
training, evaluation, and validation sets. The training set is the data you’ll use to train your machine
learning model. The evaluation set will assess the data and the model, while the validation set will
validate it.

Data Preprocessing Examples and Techniques

Data Transformation

One of the most important stages in the preparation phase is data transformation, which changes
data from one format to another. Some algorithms require that the input data be changed – if you
fail to finish this process, you may receive poor model performance or even introduce bias.

For example, the KNN model uses distance measurements to determine which neighbors are closest
to a particular record. If you have a feature with a particularly high scale relative to the other
features in your model, your model will likely employ this feature more than the others, resulting in a
bias.

Feature Engineering

The feature engineering strategy is used to produce better features for your dataset, which will
improve the model’s performance. We mostly employ domain knowledge to produce those features,
which we manually generate from existing features after applying a transformation to them.

Here are some simple examples to help you understand this:

Imagine that you have a hair color feature in your data with values of brown, black, or unknown. In
this scenario, you may add a new column named “has color” and assign 1 if there is a color and 0 if
the value is unknown.

Another example is deconstructing a date/time feature, which provides significant information but is
difficult for a model to use in its original format. So, if you believe your problem involves temporal
dependencies and you discover a link between the date/time and the output variable, spend some
time trying to turn that date/time column into a more intelligible feature for your model, such as
“period of the day,” “day of the week,” or so on.

Imbalanced Data

One of the most prevalent issues you may encounter while working with real-world data
categorization is that the classes are unbalanced (one contains more samples than the other),
resulting in a significant bias for the model.

Imagine that you’d like to forecast if a transaction is fraudulent. Based on your training data, 95% of
your dataset consists of legitimate transaction records, whereas just 5% consists of fraudulent
transactions. Based on this, your model will most likely forecast the majority class, identifying
fraudulent transactions as usual.

To solve this weakness in the dataset, you can use three techniques:

 Oversampling – Oversampling is the technique of augmenting your dataset with generated

data from the minority class. The Synthetic Minority Oversampling Technique (SMOTE) is the
most commonly used method for doing this; it selects a random sample from the minority
class.

 Undersampling – Undersampling is the process of lowering a dataset and eliminating

genuine data from the majority class. The two primary algorithms used in this method
are TomekLinks, which eliminates observations based on the nearest neighbor, and Edited
Nearest Neighbors (ENN).

 Hybrid Oversampling – The hybrid strategy incorporates both oversampling and

undersampling strategies in your dataset. One of the methods used in this technique is
the SMOTEENN, which uses the SMOTE algorithm for minority oversampling and the ENN
algorithm for majority undersampling.

Sampling Data

Technique Description

Sampling This method prevents repeating the same data in the sample, so if a record is chosen, it’s
without deleted from the population
replacement

Sampling This method doesn’t remove the object from the population and may be used several times for
with sample data because it can be picked up more than once
replacement
Technique Description

Stratified This is a more sophisticated approach that involves partitioning the data and taking random
sampling samples from each partition. In circumstances where the classes are disproportional, this
method maintains the proportionate number of classes based on the original data

Progressive This last strategy starts with a tiny dataset and gradually increases it until a suitable sample size
sampling is achieved

The more data you have, the higher the model’s accuracy. Still, some machine learning algorithms
may struggle to handle a large quantity of data, resulting in issues such as memory saturation,
computational increases to update the model parameters, etc.

To overcome this issue, you can use the following sample data techniques:

Data Preprocessing Best Practices

1. Data Cleaning

The goal here is to identify the simplest solution to correct quality concerns, such as removing
incorrect data, filling in missing data, or ensuring the raw data is appropriate for feature engineering.

2. Data Reduction

Raw data collections often contain duplicate data resulting from diverse methods of defining events,
as well as material that just doesn’t work for your machine learning architecture or project scope.

Data reduction techniques, such as principal component analysis, are used to convert raw data into a
simplified format that is appropriate for certain use cases.

3. Data Transformation

Data scientists consider how different components of the data should be structured to achieve the
best results. This might entail arranging unstructured data, merging salient variables where it makes
sense, or determining which ranges to focus on.

4. Data Enrichment

In this stage, data practitioners use various feature engineering libraries on the data to achieve the
needed changes. The end result should be a data set arranged in such a way that it strikes the best
balance between training time for a new model and compute requirements.

5. Data Validation

Data validation starts with separating data into two sets. The first set is used to train a machine
learning or deep learning algorithm. The second one serves as the test data, which is used to assess
the correctness and robustness of the final model. This second stage helps to identify any issues with
the hypothesis used in data cleaning and feature engineering.
If the team is pleased with the results, they may assign the preprocessing assignment to a data
engineer, who will choose how to scale it for production. If not, the data practitioners can go back
and adjust how they carry out the data cleaning and feature engineering procedures.

Data Preprocessing with lakeFS

Keeping track of many versions of data might be equally challenging. Without proper coordination,
balance, and precision, everything can fall apart easily. And this is the last place you want to be when
starting a new machine learning project.

Data versioning is a method that allows you to keep track of many versions of the same data without
incurring significant storage expenses.

Creating machine learning models takes more than simply executing code; it also calls for training
data and the appropriate parameters. Updating machine learning models is an iterative process
where you need to keep track of all previous modifications.

Data versioning enables you to keep a snapshot of the training data and experimental results, making
implementation easier at each iteration.

Versioning data also helps to achieve compliance. Almost every company is subject to data
protection regulations such as GDPR, which require them to keep certain information in order to
verify compliance and the history of data sources. In this scenario, data versioning can benefit both
internal and external audits.

Many data preprocessing tasks become more efficient when data is maintained in the same way as
code. Data versioning tools like lakeFS help you implement data versioning at every stage of the
data’s lifecycle.

lakeFS includes hooks for zero-copy isolation, pre-commit, and pre-merge to build an automated
process. All in all, it’s a massive helping hand in the process of data quality assessment using the
practices above. Check out lakeFS on GitHub to learn more.

NN 7
No ratings yet
NN 7
26 pages
ML Data Preprocessing Guide
No ratings yet
ML Data Preprocessing Guide
5 pages
Supervised Learning Research Paper With Images
No ratings yet
Supervised Learning Research Paper With Images
10 pages
ML Da
No ratings yet
ML Da
55 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
Machine Learning Essentials Guide
No ratings yet
Machine Learning Essentials Guide
33 pages
Statistics For Data Science
100% (2)
Statistics For Data Science
39 pages
Week5 Modified
No ratings yet
Week5 Modified
25 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Machine Learning Data Prep Guide
No ratings yet
Machine Learning Data Prep Guide
9 pages
Supervised Learning Research Paper Final With Images
No ratings yet
Supervised Learning Research Paper Final With Images
11 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Data Science Statistics Guide
100% (2)
Data Science Statistics Guide
38 pages
Data Pre-processing Guide
No ratings yet
Data Pre-processing Guide
8 pages
Oe Cae 3
No ratings yet
Oe Cae 3
7 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Data Mining for Business Insights
No ratings yet
Data Mining for Business Insights
38 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
Data
No ratings yet
Data
36 pages
Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
Unit 2
No ratings yet
Unit 2
18 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
Unit I
No ratings yet
Unit I
41 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
DsNaIT v2.0
No ratings yet
DsNaIT v2.0
43 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
ML Workflow Steps: Step 2: Building Dataset
No ratings yet
ML Workflow Steps: Step 2: Building Dataset
5 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
4 pages
Unit 4 - Question Bank and Answers
No ratings yet
Unit 4 - Question Bank and Answers
23 pages
Unit - 2 ML
No ratings yet
Unit - 2 ML
8 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Chương
No ratings yet
Chương
12 pages
Machine Learning for Nigerian Languages
No ratings yet
Machine Learning for Nigerian Languages
67 pages
1725892639module 3 The Machine Learning Process
No ratings yet
1725892639module 3 The Machine Learning Process
17 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
Session 4 Machine Learning Process
No ratings yet
Session 4 Machine Learning Process
28 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
Dsur Ea2352001010391 W7
No ratings yet
Dsur Ea2352001010391 W7
3 pages
How To Prepare Data For Machine Learning
No ratings yet
How To Prepare Data For Machine Learning
34 pages
Unit - 2 ML
No ratings yet
Unit - 2 ML
8 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
11 pages
Topic-2 ML Concepts
No ratings yet
Topic-2 ML Concepts
9 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
Ads Imp Qna 2025 15 04 06 06 35
No ratings yet
Ads Imp Qna 2025 15 04 06 06 35
33 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
Anomalies in Dataset
No ratings yet
Anomalies in Dataset
4 pages
SWE 227 Slide 01
No ratings yet
SWE 227 Slide 01
21 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
Chapter 2 Preparing To Model
No ratings yet
Chapter 2 Preparing To Model
49 pages
Machine Learning Data Preprocessing Guide
No ratings yet
Machine Learning Data Preprocessing Guide
24 pages
Common DS Interview Questions and Answers - 1
No ratings yet
Common DS Interview Questions and Answers - 1
4 pages
Machine Learning
No ratings yet
Machine Learning
25 pages
Application of Blockchain Technology of
No ratings yet
Application of Blockchain Technology of
9 pages
Kernel-Mode Driver Architecture Design Guide (Microsoft)
100% (1)
Kernel-Mode Driver Architecture Design Guide (Microsoft)
1,279 pages
MODULE 3 Software Testing Notes
No ratings yet
MODULE 3 Software Testing Notes
10 pages
Tehkan World Cup
No ratings yet
Tehkan World Cup
26 pages
Research Article 3 PDF
No ratings yet
Research Article 3 PDF
6 pages
Ages - Based On Fraction Set - 1-174166908186
No ratings yet
Ages - Based On Fraction Set - 1-174166908186
4 pages
CDESS Sneak Peek Presentation 2023
No ratings yet
CDESS Sneak Peek Presentation 2023
32 pages
Ste S 24 Ste 22518 Summer 2024 Model Answer Paper Ste 22518 Model Answer Paper Summer Removed Organized
No ratings yet
Ste S 24 Ste 22518 Summer 2024 Model Answer Paper Ste 22518 Model Answer Paper Summer Removed Organized
24 pages
Communications201608 DL PDF
No ratings yet
Communications201608 DL PDF
108 pages
Jetrion 4900M & Jetrion 4900M-330: Flexible and Profi Table Label Printing Solutions
No ratings yet
Jetrion 4900M & Jetrion 4900M-330: Flexible and Profi Table Label Printing Solutions
4 pages
W11 Materialize CSS - Part 1 - Module 1 PDF
No ratings yet
W11 Materialize CSS - Part 1 - Module 1 PDF
5 pages
Programming a Calculator
No ratings yet
Programming a Calculator
30 pages
Compiler Phases Explained
No ratings yet
Compiler Phases Explained
24 pages
Static Routing for CCNA Students
No ratings yet
Static Routing for CCNA Students
10 pages
AAI Syllabus
No ratings yet
AAI Syllabus
6 pages
Step To Step Guide On Saas Product Development Process
No ratings yet
Step To Step Guide On Saas Product Development Process
6 pages
A Complete Guide To Data Science Career Path - by Great Learning & Aim PDF
100% (1)
A Complete Guide To Data Science Career Path - by Great Learning & Aim PDF
19 pages
Introduction To Python
No ratings yet
Introduction To Python
27 pages
Podcasts, Voice Tools and Audio Recorders
No ratings yet
Podcasts, Voice Tools and Audio Recorders
39 pages
MFC400 Datasheet
No ratings yet
MFC400 Datasheet
36 pages
Reliability, Availability, and Serviceability of IBM Computer Systems: A Quarter Century of Progress
No ratings yet
Reliability, Availability, and Serviceability of IBM Computer Systems: A Quarter Century of Progress
16 pages
Database MCQs for IT Students
100% (1)
Database MCQs for IT Students
7 pages
(Technical Report) Codiga-UTide PDF
No ratings yet
(Technical Report) Codiga-UTide PDF
60 pages
ECO266 Info 5
No ratings yet
ECO266 Info 5
6 pages
Message
No ratings yet
Message
3 pages
Cryptography
No ratings yet
Cryptography
43 pages
MT 2009 Answers
No ratings yet
MT 2009 Answers
8 pages
Sample Proposal of Implementing Odoo
100% (1)
Sample Proposal of Implementing Odoo
45 pages
Assignment Part 2 Done
No ratings yet
Assignment Part 2 Done
9 pages
Servo-S - Service Manual - EN - All PDF
100% (1)
Servo-S - Service Manual - EN - All PDF
80 pages

7 Data Preprocessing Steps in Machine Learning

Uploaded by

7 Data Preprocessing Steps in Machine Learning

Uploaded by

7 Data Preprocessing Steps in Machine Learning

1. Acquire the Dataset

4. Check for Missing Values

 Estimate the value using the mean, median, or mode.

5. Encode the Data

These are some of the more common scaling approaches:

7. Split Dataset Into Training, Evaluation and Validation Sets

Data Preprocessing Examples and Techniques

Here are some simple examples to help you understand this:

 Oversampling – Oversampling is the technique of augmenting your dataset with generated

 Undersampling – Undersampling is the process of lowering a dataset and eliminating

 Hybrid Oversampling – The hybrid strategy incorporates both oversampling and

Data Preprocessing Best Practices

Data Preprocessing with lakeFS

You might also like