0% found this document useful (0 votes)

24 views12 pages

ML (Prac1)

The document discusses the steps involved in data pre-processing for machine learning models. These include getting the dataset, importing libraries, handling missing data, encoding categorical variables, and splitting the data into training and test sets.

Uploaded by

dk9859164

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views12 pages

ML (Prac1)

Uploaded by

dk9859164

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

PRACTICAL NO.

Aim: Implement Data pre-processing

Theory:

Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.

When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to clean it
and put in a formatted way. So for this, we use data preprocessing task.

Why do we need Data Preprocessing?

A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data preprocessing is required
tasks for cleaning the data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.

It involves below steps:

o Getting the dataset

o Importing libraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling

1) Get the Dataset

To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data. The collected data for a particular problem in a
proper format is known as the dataset.

Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the dataset
required for a liver patient. So each dataset is different from another dataset. To use the
dataset in our code, we usually put it into a CSV file. However, sometimes, we may also need
to use an HTML or xlsx file.

Page | 1
What is a CSV File?

CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save
the tabular data, such as spreadsheets. It is useful for huge datasets and can use these datasets
in programs.

Here we will use a demo dataset for data preprocessing, and for practice, it can be
downloaded from here, "https://www.superdatascience.com/pages/machine-learning. For
real-world problems, we can download datasets online from various sources such
as https://www.kaggle.com/uciml/datasets, https://archive.ics.uci.edu/ml/index.php etc.

We can also create our dataset by gathering data using various API with Python and put that
data into a .csv file.

2) Importing Libraries

In order to perform data preprocessing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data preprocessing, which are:

Numpy: Numpy Python library is used for including any type of mathematical operation in
the code. It is the fundamental package for scientific calculation in Python. It also supports to
add large, multidimensional arrays and matrices. So, in Python, we can import it as:

1. import numpy as nm

Here we have used nm, which is a short name for Numpy, and it will be used in the whole
program.

Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and
with this library, we need to import a sub-library pyplot. This library is used to plot any type
of charts in Python for the code. It will be imported as below:

1. import matplotlib.pyplot as mpt

Here we have used mpt as a short name for this library.

Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. It will be imported as below:

Here, we have used pd as a short name for this library. Consider the below image:

Page | 2
3) Importing the Datasets

Now we need to import the datasets which we have collected for our machine learning
project. But before importing a dataset, we need to set the current directory as a working
directory. To set a working directory in Spyder IDE, we need to follow the below steps:

1. Save your Python file in the directory which contains dataset.

2. Go to File explorer option in Spyder IDE, and select the required directory.
3. Click on F5 button or run option to execute the file.

Here, in the below image, we can see the Python file along with required dataset.
Now, the current folder is set as a working directory.

read_csv() function:

Now to import the dataset, we will use read_csv() function of pandas library, which is
used to read a csv file and performs various operations on it. Using this function, we
can read a csv file locally as well as through an URL.

We can use read_csv function as below:

1. data_set= pd.read_csv('Dataset.csv')

Page | 3
Here, data_set is a name of the variable to store our dataset, and inside the function,
we have passed the name of our dataset. Once we execute the above line of code, it
will successfully import the dataset in our code. We can also check the imported
dataset by clicking on the section variable explorer, and then double click
on data_set. Consider the below image:

As in the above image, indexing is started from 0, which is the default indexing in
Python. We can also change the format of our dataset by clicking on the format
option.

Extracting dependent and independent variables:

In machine learning, it is important to distinguish the matrix of features (independent

variables) and dependent variables from dataset. In our dataset, there are three independent
variables that are Country, Age, and Salary, and one is a dependent variable which
is Purchased.

Extracting independent variable:

To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to
extract the required rows and columns from the dataset.

1. x= data_set.iloc[:,:-1].values

Page | 4
In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is for
all the columns. Here we have used :-1, because we don't want to take the last column as it
contains the dependent variable. So by doing this, we will get the matrix of features.

By executing the above code, we will get output as:

1. [['India' 38.0 68000.0]

2. ['France' 43.0 45000.0]
3. ['Germany' 30.0 54000.0]
4. ['France' 48.0 65000.0]
5. ['Germany' 40.0 nan]
6. ['India' 35.0 58000.0]
7. ['Germany' nan 53000.0]
8. ['France' 49.0 79000.0]
9. ['India' 50.0 88000.0]
10. ['France' 37.0 77000.0]]

Extracting dependent variable:

To extract dependent variables, again, we will use Pandas .iloc[] method.

1. y= data_set.iloc[:,3].values

Here we have taken all the rows with the last column only. It will give the array of dependent
variables.

By executing the above code, we will get output as:

Output:

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)

4) Handling Missing data:

The next step of data preprocessing is to handle missing data in the datasets. If our dataset
contains some missing data, then it may create a huge problem for our machine learning
model. Hence it is necessary to handle missing values present in the dataset.

Ways to handle missing data:

There are mainly two ways to handle missing data, which are:

By deleting the particular row: The first way is used to commonly deal with null values. In
this way, we just delete the specific row or column which consists of null values. But this

Page | 5
way is not so efficient and removing data may lead to loss of information which will not give
the accurate output.

By calculating the mean: In this way, we will calculate the mean of that column or row
which contains any missing value and will put it on the place of missing value. This strategy
is useful for the features which have numeric data such as age, salary, year, etc. Here, we will
use this approach.

To handle missing values, we will use Scikit-learn library in our code, which contains
various libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library. Below is the code for it:

1. #handling missing data (Replacing missing data with the mean value)
2. from sklearn.preprocessing import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
4. #Fitting imputer object to the independent variables x.
5. Imputer imputer= imputer.fit(x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= imputer.transform(x[:, 1:3])

Output:

array([['India', 38.0, 68000.0],

['France', 43.0, 45000.0],
['Germany', 30.0, 54000.0],
['France', 48.0, 65000.0],
['Germany', 40.0, 65222.22222222222],
['India', 35.0, 58000.0],
['Germany', 41.111111111111114, 53000.0],
['France', 49.0, 79000.0],
['India', 50.0, 88000.0],
['France', 37.0, 77000.0]], dtype=object

5) Encoding Categorical data:

Categorical data is data which has some categories such as, in our dataset; there are two
categorical variable, Country, and Purchased.

Since machine learning model completely works on mathematics and numbers, but if our
dataset would have a categorical variable, then it may create trouble while building the
model. So it is necessary to encode these categorical variables into numbers.

For Country variable:

Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.

1. #Catgorical data

Page | 6
2. #for Country Variable
3. from sklearn.preprocessing import Label Encoder
4. label_encoder_x= LabelEncoder()
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])

Output:

Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)

6) Splitting the Dataset into the Training set and Test set

In machine learning data preprocessing, we divide our dataset into a training set and test set.
This is one of the crucial steps of data preprocessing as by doing this, we can enhance the
performance of our machine learning model.

Suppose, if we have given training to our machine learning model by a dataset and we test it
by a completely different dataset. Then, it will create difficulties for our model to understand
the correlations between the models.

If we train our model very well and its training accuracy is also very high, but we provide a
new dataset to it, then it will decrease the performance. So we always try to make a machine
learning model which performs well with the training set and also with the test dataset. Here,
we can define these datasets as:

Training Set: A subset of dataset to train the machine learning model, and we already know
the output.

Test set: A subset of dataset to test the machine learning model, and by using the test set,
model predicts the output.

For splitting the dataset, we will use the below lines of code:

Page | 7
1. from sklearn.model_selection import train_test_split
2. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

Explanation:

o In the above code, the first line is used for splitting arrays of the dataset into random
train and test subsets.
o In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
o In train_test_split() function, we have passed four parameters in which first two are
for arrays of data, and test_size is for specifying the size of the test set. The test_size
maybe .5, .3, or .2, which tells the dividing ratio of training and testing sets.
o The last parameter random_state is used to set a seed for a random generator so that
you always get the same result, and the most used value for this is 42.

Output:

By executing the above code, we will get 4 different variables, which can be seen
under the variable explorer section.

As we can see in the above image, the x and y variables are divided into 4 different
variables with corresponding values.

7) Feature Scaling

Page | 8
Feature scaling is the final step of data preprocessing in machine learning. It is a technique to
standardize the independent variables of the dataset in a specific range. In feature scaling, we
put our variables in the same range and in the same scale so that no any variable dominate the
other variable.

Consider the below dataset:

As we can see, the age and salary column values are not on the same scale. A machine
learning model is based on Euclidean distance, and if we do not scale the variable, then it
will cause some issue in our machine learning model.

Euclidean distance is given as:

Page | 9
If we compute any two values from age and salary, then salary values will dominate the age
values, and it will produce an incorrect result. So to remove this issue, we need to perform
feature scaling for machine learning.

There are two ways to perform feature scaling in machine learning:

Standardization

Normalization

Here, we will use the standardization method for our dataset.

For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:

1. from sklearn.preprocessing import StandardScaler

Page | 10
Now, we will create the object of StandardScaler class for independent variables or features.
And then we will fit and transform the training dataset.

1. st_x= StandardScaler()
2. x_train= st_x.fit_transform(x_train)

For test dataset, we will directly apply transform() function instead

of fit_transform() because it is already done in training set.

1. x_test= st_x.transform(x_test)

Output:

By executing the above lines of code, we will get the scaled values for x_train and x_test as:

x_train:

x_test:

Page | 11
As we can see in the above output, all the variables are scaled between values -1 to 1.

Page | 12

Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
Data Mining Using Python Manual
No ratings yet
Data Mining Using Python Manual
69 pages
Data Mining Lab Manual 2 2
No ratings yet
Data Mining Lab Manual 2 2
63 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
33 pages
Data Mining Using Python Lab
100% (1)
Data Mining Using Python Lab
63 pages
Machine Learning Practical
No ratings yet
Machine Learning Practical
59 pages
Data Preparation
No ratings yet
Data Preparation
19 pages
Data Pre-Processing Steps
No ratings yet
Data Pre-Processing Steps
32 pages
Industrial Arene Chemistry Markets, Technologies, Sustainable Processes and Case p1
No ratings yet
Industrial Arene Chemistry Markets, Technologies, Sustainable Processes and Case p1
1,018 pages
Class Xii PDF For Practical
No ratings yet
Class Xii PDF For Practical
24 pages
ML Lab
No ratings yet
ML Lab
46 pages
CSL0777 L09
No ratings yet
CSL0777 L09
29 pages
MACHINE LEARNING Manual
No ratings yet
MACHINE LEARNING Manual
36 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
Machine Learning Lab File
No ratings yet
Machine Learning Lab File
45 pages
Day11 Machine Learning
No ratings yet
Day11 Machine Learning
37 pages
Data Preprocessing For Machine Learning in Python
No ratings yet
Data Preprocessing For Machine Learning in Python
27 pages
ML Lab Manual (Upto Cie-1)
No ratings yet
ML Lab Manual (Upto Cie-1)
33 pages
Da Program Upto 6
No ratings yet
Da Program Upto 6
20 pages
Learning Algorithms & Models
No ratings yet
Learning Algorithms & Models
9 pages
4 Data Preprocessing
No ratings yet
4 Data Preprocessing
27 pages
r20 Datamining Lab (2-2 Sem Lab)
No ratings yet
r20 Datamining Lab (2-2 Sem Lab)
41 pages
Machine Learning Algorithms PDF
100% (1)
Machine Learning Algorithms PDF
148 pages
Data Preprocessing 1
No ratings yet
Data Preprocessing 1
6 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
24 pages
Data Pre Process I
No ratings yet
Data Pre Process I
6 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
4 pages
Jashan ML
No ratings yet
Jashan ML
20 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
Lab Mannual of ML
No ratings yet
Lab Mannual of ML
43 pages
Kartik MLP 4-9prg
No ratings yet
Kartik MLP 4-9prg
10 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Advance Python
No ratings yet
Advance Python
5 pages
Deep Learning
No ratings yet
Deep Learning
25 pages
Week 4
No ratings yet
Week 4
2 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
Data Preprocessing Tutorial
No ratings yet
Data Preprocessing Tutorial
39 pages
Data Preprocessing Implementation 13112023 061217pm
No ratings yet
Data Preprocessing Implementation 13112023 061217pm
31 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Model Learning Steps
No ratings yet
Model Learning Steps
12 pages
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
Pre-Processing Example - 1
No ratings yet
Pre-Processing Example - 1
6 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Dsbda Lab - 1 - 1736243987425
No ratings yet
Dsbda Lab - 1 - 1736243987425
10 pages
Experiment 1 Solution
No ratings yet
Experiment 1 Solution
5 pages
7 محاضرات
No ratings yet
7 محاضرات
36 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Data Preprocessing in Python
No ratings yet
Data Preprocessing in Python
3 pages
How To Prepare Your Dataset For Machine Learning in Python
No ratings yet
How To Prepare Your Dataset For Machine Learning in Python
14 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
World Watch Geography TG 1 22 Dec
No ratings yet
World Watch Geography TG 1 22 Dec
134 pages
Egyption Code of Practice For Steel Construction
No ratings yet
Egyption Code of Practice For Steel Construction
142 pages
Handling Missing Values in A Real-Time Dataset During
No ratings yet
Handling Missing Values in A Real-Time Dataset During
5 pages
Amity Rajasthan Ex Post Facto Approval
No ratings yet
Amity Rajasthan Ex Post Facto Approval
23 pages
Maths Pearson Class 8
No ratings yet
Maths Pearson Class 8
5 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
5 pages
Honda HF2114S LawnMower UserGuide
100% (1)
Honda HF2114S LawnMower UserGuide
51 pages
TABLEAU PORTFOLIO PROJECT Add This INTERACTIVE DASHBOARD To Your Data Portfolio (English (Auto-Generated) ) (DownloadYoutubeSubtitles - Com)
No ratings yet
TABLEAU PORTFOLIO PROJECT Add This INTERACTIVE DASHBOARD To Your Data Portfolio (English (Auto-Generated) ) (DownloadYoutubeSubtitles - Com)
27 pages
Impression Materials
100% (1)
Impression Materials
99 pages
Competency Assessment Results Summary
100% (2)
Competency Assessment Results Summary
1 page
RISC-V Single Cycle RTL Design
No ratings yet
RISC-V Single Cycle RTL Design
10 pages
Front Page With Roman Numerals Edited
No ratings yet
Front Page With Roman Numerals Edited
4 pages
My Recognition Ceremony Script For SY 2023-2024
No ratings yet
My Recognition Ceremony Script For SY 2023-2024
4 pages
The Balanced Scorecard: A Tool To Implement Strategy
No ratings yet
The Balanced Scorecard: A Tool To Implement Strategy
39 pages
Caring
No ratings yet
Caring
3 pages
Strategic Thinking May 09
No ratings yet
Strategic Thinking May 09
11 pages
Introduction To Hyundai
No ratings yet
Introduction To Hyundai
14 pages
Smart Watch Wristband Sal1
No ratings yet
Smart Watch Wristband Sal1
4 pages
Confirmation For Booking ID # 1200100181
No ratings yet
Confirmation For Booking ID # 1200100181
1 page
Carlo - Gavazzi RCP11003230VAC Datasheet PDF
No ratings yet
Carlo - Gavazzi RCP11003230VAC Datasheet PDF
3 pages
Module Ned 204
No ratings yet
Module Ned 204
9 pages
Hours
0% (1)
Hours
10 pages
Health Economics and Decision Science (HEDS) : Research
No ratings yet
Health Economics and Decision Science (HEDS) : Research
25 pages
Ravena Config File
No ratings yet
Ravena Config File
14 pages
Is Angle Section
No ratings yet
Is Angle Section
1 page
09 GSM BSS Network KPI (Handover Success Rate) Optimization Manual
No ratings yet
09 GSM BSS Network KPI (Handover Success Rate) Optimization Manual
29 pages
A45 AMG Edition 1 Pricelist Malaysia
No ratings yet
A45 AMG Edition 1 Pricelist Malaysia
2 pages
Rechecking Request Application
No ratings yet
Rechecking Request Application
6 pages
Adel Resume.
No ratings yet
Adel Resume.
4 pages
Indian Standard Vacuum Flanges
No ratings yet
Indian Standard Vacuum Flanges
15 pages
6.turbulent Flush Sampling System - TJH2b Analytical Services
No ratings yet
6.turbulent Flush Sampling System - TJH2b Analytical Services
1 page
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet

ML (Prac1)

Uploaded by

ML (Prac1)

Uploaded by

PRACTICAL NO.

Aim: Implement Data pre-processing

Why do we need Data Preprocessing?

It involves below steps:

o Getting the dataset

1) Get the Dataset

1. import matplotlib.pyplot as mpt

Here we have used mpt as a short name for this library.

1. Save your Python file in the directory which contains dataset.

We can use read_csv function as below:

Extracting dependent and independent variables:

In machine learning, it is important to distinguish the matrix of features (independent

Extracting independent variable:

By executing the above code, we will get output as:

1. [['India' 38.0 68000.0]

Extracting dependent variable:

To extract dependent variables, again, we will use Pandas .iloc[] method.

By executing the above code, we will get output as:

4) Handling Missing data:

Ways to handle missing data:

array([['India', 38.0, 68000.0],

5) Encoding Categorical data:

For Country variable:

Consider the below dataset:

Euclidean distance is given as:

There are two ways to perform feature scaling in machine learning:

Here, we will use the standardization method for our dataset.

1. from sklearn.preprocessing import StandardScaler

For test dataset, we will directly apply transform() function instead

You might also like