[go: up one dir, main page]

0% found this document useful (0 votes)
14 views22 pages

Data Preprocessing

Data preprocessing in Machine Learning is essential for transforming raw data into a format suitable for building and training models, enhancing data quality and enabling accurate predictions. It involves several steps including data cleaning, integration, transformation, and reduction to address issues like missing values, noise, and inconsistencies. Quality assessment ensures data completeness, accuracy, consistency, and validity, which are critical for effective data analysis and decision-making.

Uploaded by

Karuna Salgotra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views22 pages

Data Preprocessing

Data preprocessing in Machine Learning is essential for transforming raw data into a format suitable for building and training models, enhancing data quality and enabling accurate predictions. It involves several steps including data cleaning, integration, transformation, and reduction to address issues like missing values, noise, and inconsistencies. Quality assessment ensures data completeness, accuracy, consistency, and validity, which are critical for effective data analysis and decision-making.

Uploaded by

Karuna Salgotra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Data Preprocessing

• Data Preprocessing includes the steps we need to follow to


transform or encode data so that it may be easily parsed by
the machine.
• The main agenda for a model to be accurate and precise in
predictions is that the algorithm should be able to easily
interpret the data's features.
• Data preprocessing in Machine Learning is a crucial step that
helps enhance the quality of data to promote the extraction
of meaningful insights from the data.
• Data preprocessing in Machine Learning refers to the
technique of preparing (cleaning and organizing) the raw
data to make it suitable for a building and training Machine
Learning models.
Data Preprocessing Cont..
• In simple words, data preprocessing in Machine Learning is a data
mining technique that transforms raw data into an
understandable and readable format.
Types Data in Machine Learning
Measure of
What is
Scale Examples Description central
meaningful
tendency
Denotes name
Name,
or gender. Only count is
Nominal Gender, Mode
Ordering is not meaningful
Pin code.
possible.

Only ordering is
possible.
Low, Medium, High.
We can order
Ordinal Median, Mode
the values. Not possible to
Rank 1st, 2nd, 3rd. measure the
distance
between values.

0°C does not


Year mean there is no Only difference
Mean, Median,
Interval Temperature temperature. is meaningful,
Mode
Celsius (°C). True zero but not the ratio
absent.

Both difference
Height, True zero is Mean, Median,
Ratio scale and ratio are
Weight. present Mode
meaningful
Need of Data Pre-processing
• The majority of the real-world datasets are highly susceptible to
missing, inconsistent, and noisy data due to their heterogeneous
origin.

• Applying data mining algorithms on this noisy data would not give
quality results as they would fail to identify patterns effectively.

• Data Processing is, therefore, important to improve the overall data


quality.

• Duplicate or missing values may give an incorrect view of the overall


statistics of data.

• Outliers and inconsistent data points often tend to disturb the


model’s overall learning, leading to false predictions.
Need of Data Pre-processing Cont..
The quality can be checked by the following:

Accuracy: To check whether the data entered is


correct or not.
Completeness: To check whether the data is available
or not recorded.
Consistency: To check whether the same data is kept
in all the places that do or do not match.
Timeliness: The data should be updated correctly.
Believability: The data should be trustable.
Interpretability: The understandability of the data.
Steps in Data Preprocessing in Machine Learning
Data Pre-processing Methods
1) Data Cleaning
Data Cleaning is particularly done as part of data preprocessing to clean the data
by filling missing values, smoothing the noisy data, resolving the inconsistency,
and removing outliers.

1) Missing values: Here are a few ways to solve this issue:

a) Ignore those tuples

This method should be considered when the dataset is huge and numerous
missing values are present within a tuple.

b) Fill in the missing values

There are many methods to achieve this, such as filling in the values manually,
predicting the missing values using regression method, or numerical methods like
attribute mean.
1) Data Cleaning Cont..
2. Noisy Data

It involves removing a random error or variance in a measured


variable. It can be done with the help of the following techniques.

a) Binning

It is the technique that works on sorted data values to smoothen


any noise present in it. The data is divided into equal-sized bins,
and each bin/bucket is dealt with independently. All data in a
segment can be replaced by its mean, median or boundary values.
1) Data Cleaning Cont..
b) Regression

This data mining technique is generally used for prediction. It helps


to smoothen noise by fitting all the data points in a regression
function. The linear regression equation is used if there is only one
independent attribute; else Polynomial equations are used.

c) Clustering

Creation of groups/clusters from data having similar values. The


values that don't lie in the
cluster can be treated as noisy data and can be removed.
Data Integration
• Data Integration is one of the data preprocessing
steps that are used to merge the data present in
multiple sources into a single larger data store like
a data warehouse.
• Data Integration is needed especially when we
are aiming to solve a real-world scenario like
detecting the presence of nodules from CT Scan
images.
• The only option is to integrate the images from
multiple medical nodes to form a larger database.
Data Integration Cont..
We might run into some issues while adopting Data
Integration as one of the Data Preprocessing steps:

• Schema integration and object matching: The


data can be present in different formats, and
attributes that might cause difficulty in data
integration.
• Removing redundant attributes from all data
sources.
• Detection and resolution of data value conflicts.
Data Transformation
Once data clearing has been done, we need to
consolidate the quality data into alternate forms by
changing the value, structure, or format of data using
the below-mentioned Data Transformation strategies.

a) Generalization

The low-level or granular data that we have converted


to high-level information by using concept
hierarchies. We can transform the primitive data in
the address like the city to higher-level information
like the country.
Data Transformation Cont..
b) Normalization

It is the most important Data Transformation technique


widely used. The numerical attributes are scaled up or down
to fit within a specified range. In this approach, we are
constraining our data attribute to a particular container to
develop a correlation among different data points.
Normalization can be done in multiple ways, which are
highlighted here:

Min-max normalization
Z-Score normalization
Decimal scaling normalization
Data Transformation Cont..
c) Attribute Selection

New properties of data are created from existing attributes


to help in the data mining process. For example, date of
birth, data attribute can be transformed to another property
like is_senior_citizen for each tuple, which will directly
influence predicting diseases or chances of survival, etc.

d) Aggregation

It is a method of storing and presenting data in a summary


format. For example sales, data can be aggregated and
transformed to show as per month and year format.
Data Reduction
• The size of the dataset in a data warehouse can
be too large to be handled by data analysis and
data mining algorithms.

• One possible solution is to obtain a reduced


representation of the dataset that is much smaller
in volume but produces the same quality of
analytical results.
Data Reduction Cont..
Various Data Reduction strategies:

a) Data cube aggregation

It is a way of data reduction, in which the gathered data is expressed


in a summary form.

b) Dimensionality reduction

In Dimension Reduction, irrelevant, weakly relevant, or redundant


attributes or dimensions may be detected and removed. This
techniques are used to perform feature extraction. The
dimensionality of a dataset refers to the attributes or individual
features of the data. This technique aims to reduce the number of
redundant features we consider in machine learning algorithms.
Data Reduction Cont..
c) Data compression

In Data Compression, encoding mechanisms are used to reduce data set


size. The methods used for Data Compression are Wavelet Transform and
Principle Component Analysis.

d) Discretization

Data discretization is used to divide the attributes of the continuous


nature into data with intervals. This is done because continuous features
tend to have a smaller chance of correlation with the target variable.
Thus, it may be harder to interpret the results. After discretizing a
variable, groups corresponding to the target can be interpreted. For
example, attribute age can be discretized into bins like below 18, 18-44,
44-60, above 60.
Data Reduction Cont..
e) Numerosity reduction

The data can be represented as a model or equation like a regression


model. This would save the burden of storing huge datasets instead of
a model.

f) Attribute subset selection

It is very important to be specific in the selection of attributes.


Otherwise, it might lead to high dimensional data, which are difficult
to train due to underfitting/overfitting problems. Only attributes that
add more value towards model training should be considered, and the
rest all can be discarded.
Data Quality Assessment
Data Quality Assessment includes the statistical approaches one
needs to follow to ensure that the data has no issues. Data is to be
used for operations, customer management, marketing analysis,
and decision making—hence it needs to be of high quality.

The main components of Data Quality Assessment include:

• The completeness with no missing attribute values


• Accuracy and reliability in terms of information
• Consistency in all features
• Maintain data validity
• It does not contain any redundancy
Data Quality Assessment Cont..
Data Quality Assurance process has involves three main
activities.

Data profiling: It involves exploring the data to identify the


data quality issues. Once the analysis of the issues is done,
the data needs to be summarized according to no
duplicates, blank values etc identified.

Data cleaning: It involves fixing data issues.

Data monitoring: It involves maintaining data in a clean


state and having a continuous check on business needs
being satisfied by the data.

You might also like