Ch8 Data and Its Processing

The document provides an overview of data processing and its importance in machine learning, emphasizing the need for quality data and various preprocessing techniques. It discusses methods such as rescaling, binarizing, and standardizing data, as well as the significance of data cleaning and handling missing values. Additionally, it covers the conversion of categorical data into numerical formats using label encoding and one-hot encoding to ensure compatibility with machine learning algorithms.

Uploaded by

angelomikkotamayo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views32 pages

Ch8 Data and Its Processing

Uploaded by

angelomikkotamayo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Mapua University

Data and It's

Processing
CS158 - 1 Artificial Intelligence
School of Information Technology
Raymond B. Sedilla, MSIT
Understanding Data Processing
Data Processing is converting data from a given form to a much more
usable and desired format, i.e., making it more meaningful and informative.
This entire process can be automated using machine learning algorithms,
mathematical modeling, and statistical knowledge. The output of this
complete process can be in any desired form like graphs, videos, charts,
tables, images, and many more, depending on the task we are performing
and the requirements of the machine.
Understanding Data Processing
The most crucial step when starting with ML is to have
data of good quality and accuracy. A huge amount of
capital, time, and resources are consumed in collecting
data.

The collected data can be in a raw form which can’t be

directly fed to the machine.

Now the prepared data can be in a form that may not be

machine-readable, so to convert this data to a readable
form, some conversion algorithms are needed.

This is the stage where algorithms and ML techniques are

required to perform the instructions provided over a
large volume of data with accuracy and optimal
computation.

In this stage, results are procured by the machine in a

meaningful manner which can be inferred easily by the
user.
Data Preprocessing in Python
Pre-processing refers to the transformations applied to our data before feeding it to the
algorithm. Data Preprocessing is a technique used to convert raw data into a clean data
set.
Need of Data Preprocessing
To achieve better results from the applied model in Machine Learning
projects, the data format has to be in a proper manner. Some specified
Machine Learning model needs information in a specified form;
The Random Forest algorithm does not support null values.
Therefore, null values have to be managed from the original raw
data set to execute the random forest algorithm.

Another aspect is that the data set should be formatted so that more
than one Machine Learning and Deep Learning algorithm are executed
in one data set, and the best out of them is chosen.
Data preprocessing techniques for
machine learning.
1. Rescale Data
a. When our data is comprised of attributes with varying scales, many
machine learning algorithms can benefit from rescaling the
attributes to all have the same scale.
b. This is useful for optimization algorithms used in the core of
machine learning algorithms like gradient descent.
c. It is also helpful for algorithms that weight inputs like regression and
neural networks and algorithms that use distance measures like K-
Nearest Neighbors.
d. We can rescale your data using scikit-learn using the MinMaxScaler
class.
Data preprocessing techniques for
machine learning.
2. Binarize Data (Make Binary)
a. We can transform our data using a binary threshold. All values above
the threshold are marked 1, and all equal to or below are marked as
0.
b. This is called binarizing your data or threshold your data. It can be
useful when you have probabilities that you want to make crisp
values. It is also helpful when feature engineering and you want to
add new features that indicate something meaningful.
c. We can create new binary attributes in Python using scikit-learn with
the Binarizer class.
Data preprocessing techniques for
machine learning.
3. Standardize Data
a. Standardization is a useful technique to transform attributes with a
Gaussian distribution and differing means and standard deviations
to a standard Gaussian distribution with a mean of 0 and a standard
deviation of 1.
b. We can standardize data using scikit-learn with the StandardScaler
class.
Overview of Data Cleaning
Data cleaning is one of the important parts of machine learning. It plays a
significant part in building a model. It surely isn’t the fanciest part of
machine learning and at the same time, there aren’t any hidden tricks or
secrets to uncover. However, the success or failure of a project relies on
proper data cleaning. Professional data scientists usually invest a very large
portion of their time in this step because of the belief that “Better data
beats fancier algorithms”.
Steps involved in Data Cleaning:
Steps involved in Data Cleaning:
1. Removal of unwanted observations - This includes deleting duplicate/
redundant or irrelevant values from your dataset. Duplicate observations
most frequently arise during data collection and Irrelevant observations are
those that don’t actually fit the specific problem that you’re trying to solve.
a. Redundant observations alter the efficiency to a great extent as the
data repeats and may add towards the correct side or towards the
incorrect side, thereby producing unfaithful results.
b. Irrelevant observations are any type of data that is of no use to us
and can be removed directly.
Steps involved in Data Cleaning:
2. Fixing Structural errors - The errors that arise during measurement,
transfer of data, or other similar situations are called structural errors.
Structural errors include typos in the name of features, the same attribute
with a different name, mislabeled classes, i.e. separate classes that should
really be the same or inconsistent capitalization.
a. For example, the model will treat America and America as different
classes or values, though they represent the same value or red,
yellow, and red-yellow as different classes or attributes, though one
class can be included in the other two classes. So, these are some
structural errors that make our model inefficient and give poor
quality results.
Steps involved in Data Cleaning:
3. Managing Unwanted outliers - Outliers can cause problems with certain
types of models. For example, linear regression models are less robust to
outliers than decision tree models. Generally, we should not remove
outliers until we have a legitimate reason to remove them. Sometimes,
removing them improves performance, sometimes not. So, one must have a
good reason to remove the outlier, such as suspicious measurements that
are unlikely to be part of real data.
Steps involved in Data Cleaning:
4. Handling missing data - Missing data is a deceptively tricky issue in
machine learning. We cannot just ignore or remove the missing
observation. They must be handled carefully as they can be an indication of
something important. The two most common ways to deal with missing
data are:
a. Dropping observations with missing values.
The fact that the value was missing may be informative in itself.
Plus, in the real world, you often need to make predictions on
new data even if some of the features are missing!
b. Imputing the missing values from past observations.
Again, “missingness” is almost always informative in itself, and
you should tell your algorithm if a value was missing.
Even if you build a model to impute your values, you’re not
adding any real information. You’re just reinforcing the patterns
already provided by other features.
Some Data Cleaning Tools
Openrefine - It is similar to spreadsheet applications and can handle
spreadsheet file formats such as CSV, but it behaves more like a
database.
Trifacta Wrangler - Trifacta is the only open and interactive cloud
platform for data engineers and analysts to collaboratively profile,
prepare, and pipeline data for analytics and machine learning
TIBCO Clarity - is a data preparation tool that offers you on-demand
software services from the web in the form of Software-as-a-Service.
Cloudingo - The only data cleansing app w/ an Undo Button means this
is the only app w/ peace of mind.
IBM Infosphere Quality Stage - Information Server is a leading data
integration platform that helps you more easily understand, cleanse,
monitor and transform data.
Feature Scaling
A technique to standardize the independent features present in the data in
a fixed range. It is performed during the data pre-processing to handle
highly varying magnitudes or values or units. If feature scaling is not done,
then a machine learning algorithm tends to weigh greater values, and
higher and consider smaller values as the lower values, regardless of the
unit of the values.
Feature Scaling (Example)
If an algorithm is not using the feature scaling method then it can consider
the value 3000 meters to be greater than 5 km but that’s actually not true
and in this case, the algorithm will give wrong predictions. So, we use
Feature Scaling to bring all values to the same magnitudes and thus, tackle
this issue.
What is Categorical Data?
Categorical data are variables that contain label values rather than
numeric values.
The number of possible values is often limited to a fixed set.
Categorical variables are often called nominal.
Some examples include:
A “pet” variable with the values: “dog” and “cat“.
A “color” variable with the values: “red“, “green”, and “blue“.
A “place” variable with the values: “first”, “second”, and “third“.
Each value represents a different category.
Some categories may have a natural relationship to each other, such as
a natural ordering.
The “place” variable above does have a natural ordering of values. This
type of categorical variable is called an ordinal variable.
What is the Problem with
Categorical Data?
Some algorithms can work with categorical data directly.
For example, a decision tree can be learned directly from categorical
data with no data transform required (this depends on the specific
implementation).
Many machine learning algorithms cannot operate on label data
directly. They require all input variables and output variables to be
numeric.
How to Convert Categorical Data to
Numerical Data?
Some algorithms can work with categorical data directly.
For example, a decision tree can be learned directly from categorical
data with no data transform required (this depends on the specific
implementation).
Many machine learning algorithms cannot operate on label data
directly. They require all input variables and output variables to be
numeric.
Label Encoding
It refers to converting the labels into a numeric form so as to convert them
into a machine-readable form. Machine learning algorithms can then decide
in a better way how those labels must be operated. It is an important pre-
processing step for the structured dataset in supervised learning.
Suppose we have a column Height in some
dataset.
Limitation of label Encoding
Label encoding converts the data in machine-readable form, but it assigns a
unique number(starting from 0) to each class of data. This may lead to the
generation of priority issues in the training of data sets. A label with a high
value may be considered to have high priority than a label having a lower
value.
One-Hot Encoding
For categorical variables where no such ordinal relationship exists, the
integer encoding is not enough.
In fact, using this encoding and allowing the model to assume a natural
ordering between categories may result in poor performance or
unexpected results (predictions halfway between categories).
In this case, a one-hot encoding can be applied to the integer
representation. This is where the integer encoded variable is removed
and a new binary variable is added for each unique integer value.
One-Hot Encoding
In the “color” variable example, there are 3 categories and therefore 3
binary variables are needed. A “1” value is placed in the binary variable
for the color and “0” value for the other colors.

The binary variables are often called “dummy variables” in other fields,
such as statistics.

Unit 2 - Data Munging PDF
No ratings yet
Unit 2 - Data Munging PDF
54 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Ch5 Python Programming Basics
No ratings yet
Ch5 Python Programming Basics
43 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
PPE
No ratings yet
PPE
43 pages
CSC407_Chapter 2-3
No ratings yet
CSC407_Chapter 2-3
46 pages
Machine Learning Chapter 2
No ratings yet
Machine Learning Chapter 2
37 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
Demolition Safety
No ratings yet
Demolition Safety
16 pages
Knuckle Pattern Detection Final
No ratings yet
Knuckle Pattern Detection Final
66 pages
2T2425 C1 Int Calc (Practice)
No ratings yet
2T2425 C1 Int Calc (Practice)
3 pages
Preprocessing
No ratings yet
Preprocessing
90 pages
Data Processing
No ratings yet
Data Processing
14 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
23 pages
Cheatsheet from Designing data-intensive applications
No ratings yet
Cheatsheet from Designing data-intensive applications
14 pages
Ch7 Introduction to Machine Learning
No ratings yet
Ch7 Introduction to Machine Learning
29 pages
UNIT _ Introduction_DataScience_new (1)
No ratings yet
UNIT _ Introduction_DataScience_new (1)
55 pages
02 Security 101 Gaia 2022
No ratings yet
02 Security 101 Gaia 2022
42 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
PPT 1.1.5
No ratings yet
PPT 1.1.5
20 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
No ratings yet
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
91 pages
ME-221-Q02KTC-Pure-Substances-I
No ratings yet
ME-221-Q02KTC-Pure-Substances-I
7 pages
c6 - Report Branding Sobat
No ratings yet
c6 - Report Branding Sobat
21 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
139 pages
Data Mining
No ratings yet
Data Mining
22 pages
ML_DA
No ratings yet
ML_DA
55 pages
ML 2022
No ratings yet
ML 2022
10 pages
L3
No ratings yet
L3
34 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
1-Introduction to data cleaning
No ratings yet
1-Introduction to data cleaning
22 pages
E-Book Data Cleaning Techniques in Python
100% (2)
E-Book Data Cleaning Techniques in Python
50 pages
4. Data Cleaning and Preparation
No ratings yet
4. Data Cleaning and Preparation
20 pages
02 Me Applied PDF
No ratings yet
02 Me Applied PDF
10 pages
2T2425 C1 Fluids 1 (Practice)
No ratings yet
2T2425 C1 Fluids 1 (Practice)
4 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Session 2 - Data Pre-Processing
No ratings yet
Session 2 - Data Pre-Processing
19 pages
DS_UNIT_2
No ratings yet
DS_UNIT_2
23 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Notes of Turbo Codes -1
No ratings yet
Notes of Turbo Codes -1
9 pages
MCT Marking Scheme
100% (1)
MCT Marking Scheme
15 pages
Assignment 4 MB511
No ratings yet
Assignment 4 MB511
6 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Untitled
No ratings yet
Untitled
4 pages
Understanding Data Augmentation For Classification: When To Warp?
No ratings yet
Understanding Data Augmentation For Classification: When To Warp?
6 pages
dm(2)
No ratings yet
dm(2)
3 pages
2T2425 C1 Diff Calc (Practice)
No ratings yet
2T2425 C1 Diff Calc (Practice)
3 pages
What Is Data Preprocessing
No ratings yet
What Is Data Preprocessing
4 pages
Data Preprocessing
No ratings yet
Data Preprocessing
4 pages
SCR 24 Guidelines For Poster Presentations SGR SSR
No ratings yet
SCR 24 Guidelines For Poster Presentations SGR SSR
2 pages
FAFP - 16.11.2018 - Hyderabad - Saran Kumar U
No ratings yet
FAFP - 16.11.2018 - Hyderabad - Saran Kumar U
36 pages
Unit - II
No ratings yet
Unit - II
56 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Software Engineering QP
No ratings yet
Software Engineering QP
7 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
U1_DA_Data Preprocessing
No ratings yet
U1_DA_Data Preprocessing
6 pages
Exception Thrown When Trying To Communicate With Job Server On A Different Environment
No ratings yet
Exception Thrown When Trying To Communicate With Job Server On A Different Environment
2 pages
Unit-1 CyberSecurity
No ratings yet
Unit-1 CyberSecurity
22 pages
Scaner
No ratings yet
Scaner
2 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
Chương
No ratings yet
Chương
12 pages
Alexis Love Resume
No ratings yet
Alexis Love Resume
1 page
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
5.1.2.8 Lab - Viewing Network Device MAC Addresses
No ratings yet
5.1.2.8 Lab - Viewing Network Device MAC Addresses
8 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
Assignment 2 C Language
No ratings yet
Assignment 2 C Language
7 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
dell-powerstore-500-dc-spec-sheet
No ratings yet
dell-powerstore-500-dc-spec-sheet
8 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Ayush-Munot-Resume
No ratings yet
Ayush-Munot-Resume
1 page
BCS303 Questions Bank
No ratings yet
BCS303 Questions Bank
7 pages
Dhanalakshmi Srinivasan: College of Engineering & Technology
No ratings yet
Dhanalakshmi Srinivasan: College of Engineering & Technology
2 pages
MT6765 Android Scatter
No ratings yet
MT6765 Android Scatter
20 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Data Cleaning (Examples)
No ratings yet
Data Cleaning (Examples)
9 pages
Module 2
No ratings yet
Module 2
8 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Data Binning
No ratings yet
Data Binning
9 pages
Host To Host Communication in Networking
No ratings yet
Host To Host Communication in Networking
3 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
MT6592 Android Scatter
No ratings yet
MT6592 Android Scatter
6 pages
Experiment 3: Decision Making and Looping Operation Using 8086
No ratings yet
Experiment 3: Decision Making and Looping Operation Using 8086
5 pages
Lesson 2 - Introduction To SDS2
No ratings yet
Lesson 2 - Introduction To SDS2
2 pages
AutoCAD Quiz 3
No ratings yet
AutoCAD Quiz 3
7 pages
Toolbox Wiki (Fun Tools To Use For Our Assignments)
No ratings yet
Toolbox Wiki (Fun Tools To Use For Our Assignments)
3 pages
Cncmillingprograms 160318071113 PDF
No ratings yet
Cncmillingprograms 160318071113 PDF
33 pages
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet