[go: up one dir, main page]

0% found this document useful (0 votes)
11 views32 pages

Ch8 Data and Its Processing

The document provides an overview of data processing and its importance in machine learning, emphasizing the need for quality data and various preprocessing techniques. It discusses methods such as rescaling, binarizing, and standardizing data, as well as the significance of data cleaning and handling missing values. Additionally, it covers the conversion of categorical data into numerical formats using label encoding and one-hot encoding to ensure compatibility with machine learning algorithms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views32 pages

Ch8 Data and Its Processing

The document provides an overview of data processing and its importance in machine learning, emphasizing the need for quality data and various preprocessing techniques. It discusses methods such as rescaling, binarizing, and standardizing data, as well as the significance of data cleaning and handling missing values. Additionally, it covers the conversion of categorical data into numerical formats using label encoding and one-hot encoding to ensure compatibility with machine learning algorithms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Mapua University

Data and It's


Processing
CS158 - 1 Artificial Intelligence
School of Information Technology
Raymond B. Sedilla, MSIT
Understanding Data Processing
Data Processing is converting data from a given form to a much more
usable and desired format, i.e., making it more meaningful and informative.
This entire process can be automated using machine learning algorithms,
mathematical modeling, and statistical knowledge. The output of this
complete process can be in any desired form like graphs, videos, charts,
tables, images, and many more, depending on the task we are performing
and the requirements of the machine.
Understanding Data Processing
The most crucial step when starting with ML is to have
data of good quality and accuracy. A huge amount of
capital, time, and resources are consumed in collecting
data.

The collected data can be in a raw form which can’t be


directly fed to the machine.

Now the prepared data can be in a form that may not be


machine-readable, so to convert this data to a readable
form, some conversion algorithms are needed.

This is the stage where algorithms and ML techniques are


required to perform the instructions provided over a
large volume of data with accuracy and optimal
computation.

In this stage, results are procured by the machine in a


meaningful manner which can be inferred easily by the
user.
Data Preprocessing in Python
Pre-processing refers to the transformations applied to our data before feeding it to the
algorithm. Data Preprocessing is a technique used to convert raw data into a clean data
set.
Need of Data Preprocessing
To achieve better results from the applied model in Machine Learning
projects, the data format has to be in a proper manner. Some specified
Machine Learning model needs information in a specified form;
The Random Forest algorithm does not support null values.
Therefore, null values have to be managed from the original raw
data set to execute the random forest algorithm.

Another aspect is that the data set should be formatted so that more
than one Machine Learning and Deep Learning algorithm are executed
in one data set, and the best out of them is chosen.
Data preprocessing techniques for
machine learning.
1. Rescale Data
a. When our data is comprised of attributes with varying scales, many
machine learning algorithms can benefit from rescaling the
attributes to all have the same scale.
b. This is useful for optimization algorithms used in the core of
machine learning algorithms like gradient descent.
c. It is also helpful for algorithms that weight inputs like regression and
neural networks and algorithms that use distance measures like K-
Nearest Neighbors.
d. We can rescale your data using scikit-learn using the MinMaxScaler
class.
Data preprocessing techniques for
machine learning.
2. Binarize Data (Make Binary)
a. We can transform our data using a binary threshold. All values above
the threshold are marked 1, and all equal to or below are marked as
0.
b. This is called binarizing your data or threshold your data. It can be
useful when you have probabilities that you want to make crisp
values. It is also helpful when feature engineering and you want to
add new features that indicate something meaningful.
c. We can create new binary attributes in Python using scikit-learn with
the Binarizer class.
Data preprocessing techniques for
machine learning.
3. Standardize Data
a. Standardization is a useful technique to transform attributes with a
Gaussian distribution and differing means and standard deviations
to a standard Gaussian distribution with a mean of 0 and a standard
deviation of 1.
b. We can standardize data using scikit-learn with the StandardScaler
class.
Overview of Data Cleaning
Data cleaning is one of the important parts of machine learning. It plays a
significant part in building a model. It surely isn’t the fanciest part of
machine learning and at the same time, there aren’t any hidden tricks or
secrets to uncover. However, the success or failure of a project relies on
proper data cleaning. Professional data scientists usually invest a very large
portion of their time in this step because of the belief that “Better data
beats fancier algorithms”.
Steps involved in Data Cleaning:
Steps involved in Data Cleaning:
1. Removal of unwanted observations - This includes deleting duplicate/
redundant or irrelevant values from your dataset. Duplicate observations
most frequently arise during data collection and Irrelevant observations are
those that don’t actually fit the specific problem that you’re trying to solve.
a. Redundant observations alter the efficiency to a great extent as the
data repeats and may add towards the correct side or towards the
incorrect side, thereby producing unfaithful results.
b. Irrelevant observations are any type of data that is of no use to us
and can be removed directly.
Steps involved in Data Cleaning:
2. Fixing Structural errors - The errors that arise during measurement,
transfer of data, or other similar situations are called structural errors.
Structural errors include typos in the name of features, the same attribute
with a different name, mislabeled classes, i.e. separate classes that should
really be the same or inconsistent capitalization.
a. For example, the model will treat America and America as different
classes or values, though they represent the same value or red,
yellow, and red-yellow as different classes or attributes, though one
class can be included in the other two classes. So, these are some
structural errors that make our model inefficient and give poor
quality results.
Steps involved in Data Cleaning:
3. Managing Unwanted outliers - Outliers can cause problems with certain
types of models. For example, linear regression models are less robust to
outliers than decision tree models. Generally, we should not remove
outliers until we have a legitimate reason to remove them. Sometimes,
removing them improves performance, sometimes not. So, one must have a
good reason to remove the outlier, such as suspicious measurements that
are unlikely to be part of real data.
Steps involved in Data Cleaning:
4. Handling missing data - Missing data is a deceptively tricky issue in
machine learning. We cannot just ignore or remove the missing
observation. They must be handled carefully as they can be an indication of
something important. The two most common ways to deal with missing
data are:
a. Dropping observations with missing values.
The fact that the value was missing may be informative in itself.
Plus, in the real world, you often need to make predictions on
new data even if some of the features are missing!
b. Imputing the missing values from past observations.
Again, “missingness” is almost always informative in itself, and
you should tell your algorithm if a value was missing.
Even if you build a model to impute your values, you’re not
adding any real information. You’re just reinforcing the patterns
already provided by other features.
Some Data Cleaning Tools
Openrefine - It is similar to spreadsheet applications and can handle
spreadsheet file formats such as CSV, but it behaves more like a
database.
Trifacta Wrangler - Trifacta is the only open and interactive cloud
platform for data engineers and analysts to collaboratively profile,
prepare, and pipeline data for analytics and machine learning
TIBCO Clarity - is a data preparation tool that offers you on-demand
software services from the web in the form of Software-as-a-Service.
Cloudingo - The only data cleansing app w/ an Undo Button means this
is the only app w/ peace of mind.
IBM Infosphere Quality Stage - Information Server is a leading data
integration platform that helps you more easily understand, cleanse,
monitor and transform data.
Feature Scaling
A technique to standardize the independent features present in the data in
a fixed range. It is performed during the data pre-processing to handle
highly varying magnitudes or values or units. If feature scaling is not done,
then a machine learning algorithm tends to weigh greater values, and
higher and consider smaller values as the lower values, regardless of the
unit of the values.
Feature Scaling (Example)
If an algorithm is not using the feature scaling method then it can consider
the value 3000 meters to be greater than 5 km but that’s actually not true
and in this case, the algorithm will give wrong predictions. So, we use
Feature Scaling to bring all values to the same magnitudes and thus, tackle
this issue.
What is Categorical Data?
Categorical data are variables that contain label values rather than
numeric values.
The number of possible values is often limited to a fixed set.
Categorical variables are often called nominal.
Some examples include:
A “pet” variable with the values: “dog” and “cat“.
A “color” variable with the values: “red“, “green”, and “blue“.
A “place” variable with the values: “first”, “second”, and “third“.
Each value represents a different category.
Some categories may have a natural relationship to each other, such as
a natural ordering.
The “place” variable above does have a natural ordering of values. This
type of categorical variable is called an ordinal variable.
What is the Problem with
Categorical Data?
Some algorithms can work with categorical data directly.
For example, a decision tree can be learned directly from categorical
data with no data transform required (this depends on the specific
implementation).
Many machine learning algorithms cannot operate on label data
directly. They require all input variables and output variables to be
numeric.
How to Convert Categorical Data to
Numerical Data?
Some algorithms can work with categorical data directly.
For example, a decision tree can be learned directly from categorical
data with no data transform required (this depends on the specific
implementation).
Many machine learning algorithms cannot operate on label data
directly. They require all input variables and output variables to be
numeric.
Label Encoding
It refers to converting the labels into a numeric form so as to convert them
into a machine-readable form. Machine learning algorithms can then decide
in a better way how those labels must be operated. It is an important pre-
processing step for the structured dataset in supervised learning.
Suppose we have a column Height in some
dataset.
Limitation of label Encoding
Label encoding converts the data in machine-readable form, but it assigns a
unique number(starting from 0) to each class of data. This may lead to the
generation of priority issues in the training of data sets. A label with a high
value may be considered to have high priority than a label having a lower
value.
One-Hot Encoding
For categorical variables where no such ordinal relationship exists, the
integer encoding is not enough.
In fact, using this encoding and allowing the model to assume a natural
ordering between categories may result in poor performance or
unexpected results (predictions halfway between categories).
In this case, a one-hot encoding can be applied to the integer
representation. This is where the integer encoded variable is removed
and a new binary variable is added for each unique integer value.
One-Hot Encoding
In the “color” variable example, there are 3 categories and therefore 3
binary variables are needed. A “1” value is placed in the binary variable
for the color and “0” value for the other colors.

The binary variables are often called “dummy variables” in other fields,
such as statistics.

You might also like