[go: up one dir, main page]

0% found this document useful (0 votes)
15 views5 pages

7 Data Preprocessing Steps in Machine Learning

The document outlines seven essential data preprocessing steps in machine learning, including acquiring datasets, importing libraries, checking for missing values, encoding data, and scaling. It also discusses techniques for handling imbalanced data and best practices for data cleaning, reduction, transformation, enrichment, and validation. Additionally, it highlights the importance of data versioning for maintaining data integrity and compliance in machine learning projects.

Uploaded by

kamalapraba25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views5 pages

7 Data Preprocessing Steps in Machine Learning

The document outlines seven essential data preprocessing steps in machine learning, including acquiring datasets, importing libraries, checking for missing values, encoding data, and scaling. It also discusses techniques for handling imbalanced data and best practices for data cleaning, reduction, transformation, enrichment, and validation. Additionally, it highlights the importance of data versioning for maintaining data integrity and compliance in machine learning projects.

Uploaded by

kamalapraba25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

7 Data Preprocessing Steps in Machine Learning

1. Acquire the Dataset

Naturally, data collection is the first step in any machine learning project and the first among the data
preprocessing steps. Gathering data might seem like a straightforward process, but it’s far from that.

Most companies end up with data kept in silos and divide it across many departments, teams, and
digital solutions. For example, the marketing team might have access to a CRM system, but that
system may operate in isolation from the web analytics solution. Combining all data streams into
consolidated storage will be challenging.

2. Import Libraries

Next, it’s time to import the libraries you’ll need for your machine learning project. A library is a
collection of functions that an algorithm can call and utilize.

You can streamline data preprocessing procedures using tools and frameworks that make the process
easier to organize and execute. Without certain libraries, one-liner solutions might take hours to
code and optimize.

3. Import Datasets

The next key step is to load the data that will be utilized in the machine learning algorithm. This is the
most critical machine learning preprocessing step.

Many companies start by storing data in warehouses that require data to pass through an ETL. The
problem with this method is that you never know which data will be useful for an ML project. As a
result, warehouses are commonly used to access data through business intelligence interfaces in
order to observe metrics that we know we need to monitor.

Data lakes are used for both structured and unstructured data, including photos, videos, voice
recordings, and PDF files. However, even when data is structured, it’s not transformed prior to
storage. You load the data in its present condition and then decide how to use and alter it later.

4. Check for Missing Values

Evaluate the data and look for missing values. Missing values can break actual data trends and
potentially result in additional data loss when entire rows and columns are deleted due to a few
missing cells in the dataset.

If you discover any, you can choose from two methods to deal with this issue:

 Remove the whole row with a missing value. However, eliminating the full row increases the
likelihood of losing some critical data. This strategy is beneficial if the dataset is massive.

 Estimate the value using the mean, median, or mode.

5. Encode the Data

Non-numerical data is incomprehensible to machine learning modules. To avoid issues later on, the
data should be arranged numerically. The answer to this problem is to convert all text values to
numerical form.

6. Scaling
Scaling is unnecessary for non-distance-based algorithms (such as the decision tree). Distance-based
models, on the other hand, require all features to be scaled.

These are some of the more common scaling approaches:

Scaling Description
Approach

Min-Max It reduces the feature values between any range of options (for example, between zero
Scaler and four)

Standard It assumes that the variable is normally distributed and then scales it down until the
Scaler standard deviation is one and the distribution is centered at zero

Robust Scaler It performs best when the dataset contains outliers. After eliminating the median, the
data is scaled based on the interquartile range

Max-Abs Similar to the min-max scaler, except instead of a certain range, the feature is scaled to
Scaler its greatest absolute value

7. Split Dataset Into Training, Evaluation and Validation Sets

This is the final step among the data preprocessing steps. It’s time to divide your dataset into
training, evaluation, and validation sets. The training set is the data you’ll use to train your machine
learning model. The evaluation set will assess the data and the model, while the validation set will
validate it.

Data Preprocessing Examples and Techniques

Data Transformation

One of the most important stages in the preparation phase is data transformation, which changes
data from one format to another. Some algorithms require that the input data be changed – if you
fail to finish this process, you may receive poor model performance or even introduce bias.

For example, the KNN model uses distance measurements to determine which neighbors are closest
to a particular record. If you have a feature with a particularly high scale relative to the other
features in your model, your model will likely employ this feature more than the others, resulting in a
bias.

Feature Engineering

The feature engineering strategy is used to produce better features for your dataset, which will
improve the model’s performance. We mostly employ domain knowledge to produce those features,
which we manually generate from existing features after applying a transformation to them.

Here are some simple examples to help you understand this:


Imagine that you have a hair color feature in your data with values of brown, black, or unknown. In
this scenario, you may add a new column named “has color” and assign 1 if there is a color and 0 if
the value is unknown.

Another example is deconstructing a date/time feature, which provides significant information but is
difficult for a model to use in its original format. So, if you believe your problem involves temporal
dependencies and you discover a link between the date/time and the output variable, spend some
time trying to turn that date/time column into a more intelligible feature for your model, such as
“period of the day,” “day of the week,” or so on.

Imbalanced Data

One of the most prevalent issues you may encounter while working with real-world data
categorization is that the classes are unbalanced (one contains more samples than the other),
resulting in a significant bias for the model.

Imagine that you’d like to forecast if a transaction is fraudulent. Based on your training data, 95% of
your dataset consists of legitimate transaction records, whereas just 5% consists of fraudulent
transactions. Based on this, your model will most likely forecast the majority class, identifying
fraudulent transactions as usual.

To solve this weakness in the dataset, you can use three techniques:

 Oversampling – Oversampling is the technique of augmenting your dataset with generated


data from the minority class. The Synthetic Minority Oversampling Technique (SMOTE) is the
most commonly used method for doing this; it selects a random sample from the minority
class.

 Undersampling – Undersampling is the process of lowering a dataset and eliminating


genuine data from the majority class. The two primary algorithms used in this method
are TomekLinks, which eliminates observations based on the nearest neighbor, and Edited
Nearest Neighbors (ENN).

 Hybrid Oversampling – The hybrid strategy incorporates both oversampling and


undersampling strategies in your dataset. One of the methods used in this technique is
the SMOTEENN, which uses the SMOTE algorithm for minority oversampling and the ENN
algorithm for majority undersampling.

Sampling Data

Technique Description

Sampling This method prevents repeating the same data in the sample, so if a record is chosen, it’s
without deleted from the population
replacement

Sampling This method doesn’t remove the object from the population and may be used several times for
with sample data because it can be picked up more than once
replacement
Technique Description

Stratified This is a more sophisticated approach that involves partitioning the data and taking random
sampling samples from each partition. In circumstances where the classes are disproportional, this
method maintains the proportionate number of classes based on the original data

Progressive This last strategy starts with a tiny dataset and gradually increases it until a suitable sample size
sampling is achieved

The more data you have, the higher the model’s accuracy. Still, some machine learning algorithms
may struggle to handle a large quantity of data, resulting in issues such as memory saturation,
computational increases to update the model parameters, etc.

To overcome this issue, you can use the following sample data techniques:

Data Preprocessing Best Practices

1. Data Cleaning

The goal here is to identify the simplest solution to correct quality concerns, such as removing
incorrect data, filling in missing data, or ensuring the raw data is appropriate for feature engineering.

2. Data Reduction

Raw data collections often contain duplicate data resulting from diverse methods of defining events,
as well as material that just doesn’t work for your machine learning architecture or project scope.

Data reduction techniques, such as principal component analysis, are used to convert raw data into a
simplified format that is appropriate for certain use cases.

3. Data Transformation

Data scientists consider how different components of the data should be structured to achieve the
best results. This might entail arranging unstructured data, merging salient variables where it makes
sense, or determining which ranges to focus on.

4. Data Enrichment

In this stage, data practitioners use various feature engineering libraries on the data to achieve the
needed changes. The end result should be a data set arranged in such a way that it strikes the best
balance between training time for a new model and compute requirements.

5. Data Validation

Data validation starts with separating data into two sets. The first set is used to train a machine
learning or deep learning algorithm. The second one serves as the test data, which is used to assess
the correctness and robustness of the final model. This second stage helps to identify any issues with
the hypothesis used in data cleaning and feature engineering.
If the team is pleased with the results, they may assign the preprocessing assignment to a data
engineer, who will choose how to scale it for production. If not, the data practitioners can go back
and adjust how they carry out the data cleaning and feature engineering procedures.

Data Preprocessing with lakeFS

Keeping track of many versions of data might be equally challenging. Without proper coordination,
balance, and precision, everything can fall apart easily. And this is the last place you want to be when
starting a new machine learning project.

Data versioning is a method that allows you to keep track of many versions of the same data without
incurring significant storage expenses.

Creating machine learning models takes more than simply executing code; it also calls for training
data and the appropriate parameters. Updating machine learning models is an iterative process
where you need to keep track of all previous modifications.

Data versioning enables you to keep a snapshot of the training data and experimental results, making
implementation easier at each iteration.

Versioning data also helps to achieve compliance. Almost every company is subject to data
protection regulations such as GDPR, which require them to keep certain information in order to
verify compliance and the history of data sources. In this scenario, data versioning can benefit both
internal and external audits.

Many data preprocessing tasks become more efficient when data is maintained in the same way as
code. Data versioning tools like lakeFS help you implement data versioning at every stage of the
data’s lifecycle.

lakeFS includes hooks for zero-copy isolation, pre-commit, and pre-merge to build an automated
process. All in all, it’s a massive helping hand in the process of data quality assessment using the
practices above. Check out lakeFS on GitHub to learn more.

You might also like