0% found this document useful (0 votes)

39 views43 pages

Lecture 4 Data Pre-Processing

The document provides an overview of a lecture on data pre-processing for a machine learning course. 1) It discusses using Pandas to import, clean, and visualize data. Common techniques like handling missing values, encoding categorical features, and feature scaling are covered. 2) Examples demonstrate loading data from CSV, dropping rows with null values, replacing empty cells, and handling incorrect data. 3) The goal is for students to understand these pre-processing techniques and apply them for cleaning machine learning data.

Uploaded by

choudharynipun69

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views43 pages

Lecture 4 Data Pre-Processing

Uploaded by

choudharynipun69

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

APEX INSTITUTE OF TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

MACHINE LEARNING (21CSH-286)

Faculty: Prof. (Dr.) Vineet Mehan (E13038)

Lecture – 4 DISCOVER . LEARN . EMPOWER

1
Data Pre-Processing
Machine Learning: Course Objectives
COURSE OBJECTIVES
The Course aims to:
1. Understand and apply various data handling and visualization techniques.
2. Understand about some basic learning algorithms and techniques and their applications, as well as
general questions related to analysing and handling large data sets.
3. To develop skills of supervised and unsupervised learning techniques and implementation of these to
solve real life problems.
4. To develop basic knowledge on the machine techniques to build an intellectual machine for making
decisions behalf of humans.
5. To develop skills for selecting suitable model parameters and apply them for designing optimized
machine learning applications.

2
COURSE OUTCOMES

On completion of this course, the students shall be able to:-

CO2 Understand data pre-processing techniques and apply these for data cleaning.

3
Unit-1 Syllabus
Unit-1 Introduction to Machine Learning
Introduction to Definition of Machine Learning, Working principles of Machine
Machine Learning Learning; Classification of Machine Learning algorithms: Supervised
Learning, Unsupervised Learning, Reinforcement Learning, Semi-
Supervised Learning; Applications of Machine Learning.
Data Pre- Data Sourcing and Cleaning, Handling Missing data, Encoding
Processing and Categorical data, Feature Scaling, Handling Time Series data; Feature
Feature Selection techniques, Data Transformation, Normalization,
Extraction Dimensionality reduction
Data Visualization Data Frame Basics, Different types of analysis, Different types of
plots, Plotting fundamentals using Matplotlib, Plotting Data
Distributions using Seaborn.

4
SUGGESTIVE READINGS
• TEXT BOOKS:
• There is no single textbook covering the material presented in this course. Here is a list of books
recommended for further reading in connection with the material presented:
• T1: Tom.M.Mitchell, “Machine Learning, McGraw Hill International Edition”.
• T2: Ethern Alpaydin,” Introduction to Machine Learning. Eastern Economy Edition, Prentice Hall of
India, 2005”.
• T3: Andreas C. Miller, Sarah Guido, Introduction to Machine Learning with Python, O’REILLY (2001).

• REFERENCE BOOKS:
• R1 Sebastian Raschka, Vahid Mirjalili, Python Machine Learning, (2014)
• R2 Richard O. Duda, Peter E. Hart, David G. Stork, “Pattern Classification, Wiley, 2nd Edition”.
• R3 Christopher Bishop, “Pattern Recognition and Machine Learning, illustrated Edition, Springer, 2006”.

5
Data Sourcing
• For data sourcing Panda is used.

• Panda is a python Library for analyzing data.

• Name?
• Panda = Panel Data + Python Data Analysis (Combination) gave the
name.
• Panel data is a subset of longitudinal data where observations are for
the same subjects each time.
By: Prof. (Dr.) Vineet Mehan 6
Data Sourcing
• Use of Panda ?

• Pandas allow us to analyze big data and make conclusions based on

statistical theories.

• Pandas can clean messy data sets, and make them readable and
relevant.

• Pandas are used in Data Science.

By: Prof. (Dr.) Vineet Mehan 7
Data Sourcing
• Data Science: is a branch of computer science where we study how to
store, use and analyze data for deriving information from it.

• How to install Pandas?

• 1. Open cmd prompt
• 2. Type
• >>> python –m pip install pandas

By: Prof. (Dr.) Vineet Mehan 8

Make a data Frame that tells the type of
vehicles that passed a toll plaza.
• import pandas
• mydataset = { 'cars': ["Maruti", "Hundai", "Tata"], 'passings': [20, 12,
15]}
• myvar = pandas.DataFrame(mydataset)
• print(myvar)

By: Prof. (Dr.) Vineet Mehan 9

Import pandas as pd and use pd

By: Prof. (Dr.) Vineet Mehan 10

Read data from a CSV File

By: Prof. (Dr.) Vineet Mehan 11

Reading CSV but print without converting to
string

By: Prof. (Dr.) Vineet Mehan 12

Checking the pandas version

By: Prof. (Dr.) Vineet Mehan 13

Pandas Data Frames
• A Pandas DataFrame is a 2 dimensional data structure, like a 2
dimensional array, or a table with rows and columns.

• Create a simple Panda Data Frame

By: Prof. (Dr.) Vineet Mehan 14

Load the CSV file into data Frame

By: Prof. (Dr.) Vineet Mehan 15

Data Cleaning
• Data cleaning means fixing bad data in your data set.

• Bad data could be:

• Empty cells

• Data in wrong format

• Wrong data

• Duplicates

By: Prof. (Dr.) Vineet Mehan 16

The data set contains some empty cells ("Date" in row
22, and "Calories" in row 18 and 28).

By: Prof. (Dr.) Vineet Mehan 17

The data set contains wrong format ("Date" in row 26).

By: Prof. (Dr.) Vineet Mehan 18

The data set contains wrong data ("Duration" in row 7).

By: Prof. (Dr.) Vineet Mehan 19

The data set contains duplicates (row 11 and 12).

By: Prof. (Dr.) Vineet Mehan 20

1. Remove Rows
• One way to deal with empty cells is to remove rows that contain
empty cells.

• This is usually OK, since data sets can be very big, and removing a few
rows will not have a big impact on the result.

• See Row 17 and 27 (removed)

By: Prof. (Dr.) Vineet Mehan 21

Pandas dropna() method allows the user to analyze
and drop Rows/Columns with Null values

By default, the dropna() method returns a new

DataFrame, and will not change the original.

By: Prof. (Dr.) Vineet Mehan 22

By default, the dropna() method returns a new
DataFrame, and will not change the original.

If you want to change the original DataFrame, use

the inplace = True argument.

By: Prof. (Dr.) Vineet Mehan 23

3. Replace Empty Values

See Row 17 replaced with 130

The fillna() method allows us to replace

empty cells with a value.

It will Replace NULL values with the number 130.

By: Prof. (Dr.) Vineet Mehan 24

4. Replace value in a particular column

Values are replaced at position 17, 27, 91,

118, and 141 in the Calories column only.

By: Prof. (Dr.) Vineet Mehan 25

5. Replace Using Mean, Median, or Mode
• A common way to replace empty cells, is to calculate the mean,
median or mode value of the column.

• Mean  Average

• Median  Center value

• Mode  Most common occurring value

By: Prof. (Dr.) Vineet Mehan 26

Empty Values are replaced with mean
at position 17, 27, 91, 118, and 141 in
the Calories column only.

Mean here is 375.790244

By: Prof. (Dr.) Vineet Mehan 27

Empty Values are replaced with median
at position 17, 27, 91, 118, and 141 in
the Calories column only.

Median here is 318.6

By: Prof. (Dr.) Vineet Mehan 28

Empty Values are replaced with mode
at position 17, 27, 91, 118, and 141 in
the Calories column only.

Mode here is 300.0

By: Prof. (Dr.) Vineet Mehan 29

Wrong Data
• "Wrong data" does not have to be "empty cells" or "wrong format", it
can just be wrong, like if someone registered "199" instead of "1.99".

• Sometimes you can spot wrong data by looking at the data set,
because you have an expectation of what it should be.

• If you take a look at our data set, you can see that in row 7, the
duration is 450, but for all the other rows the duration is between 30
and 60.

By: Prof. (Dr.) Vineet Mehan 30

By: Prof. (Dr.) Vineet Mehan 31
One way to fix wrong values is to
replace them with something else.

In our example, it is most likely a typo,

and the value should be "45" instead of
"450", and we could just insert "45" in
row 7:

By: Prof. (Dr.) Vineet Mehan 32

For Larger Data
• For small data sets you might be able to replace the wrong data one
by one, but not for big data sets.

• To replace wrong data for larger data sets you can create some rules,
e.g. set some boundaries for legal values, and replace any values that
are outside of the boundaries.

By: Prof. (Dr.) Vineet Mehan 33

By: Prof. (Dr.) Vineet Mehan 34
Removing Rows
• Another way of handling wrong data is to remove the rows that
contains wrong data.

• This way you do not have to find out what to replace them with, and
there is a good chance you do not need them to do your analyses.

• Value at position no 7 is removed

By: Prof. (Dr.) Vineet Mehan 35

By: Prof. (Dr.) Vineet Mehan 36
Duplicate Data
• Duplicate rows are rows that have been registered more than one
time.

• By taking a look at our test data set, we can assume that row 11 and
12 are duplicates.

• To discover duplicates, we can use the duplicated() method.

• The duplicated() method returns a Boolean values for each row.

By: Prof. (Dr.) Vineet Mehan 37
Above program Returns True for every
row that is a duplicate, otherwise False

By: Prof. (Dr.) Vineet Mehan 38

Removing Duplicates
• To remove duplicates, use the drop_duplicates() method.

The duplicate row (row no 12) is now removed

By: Prof. (Dr.) Vineet Mehan 39

Summary
• Methods of Sourcing Data

• Methods of Cleaning Data

40
Task
• Applying various methods that are used for sourcing the data by
taking a suitable arrays\datasets etc. (BT-Level3)

• Design a model that is used to clean Empty cells, Data in wrong

format, Wrong data, and Duplicates. (BT-Level6)

By: Prof. (Dr.) Vineet Mehan 41

REFERENCES
• https://www.javatpoint.com/machine-learning

• https://www.tutorialspoint.com/machine_learning/index.htm

• https://www.w3schools.com/python/

42
THANK YOU

For queries
Email: vineet.e13038@cumail.in
43

Pandas
No ratings yet
Pandas
30 pages
Analysis of Algorithms: Matplotlib and Pandas Dataframe
No ratings yet
Analysis of Algorithms: Matplotlib and Pandas Dataframe
67 pages
Lab 3 DWM
No ratings yet
Lab 3 DWM
5 pages
IntroToPython Unit 5
No ratings yet
IntroToPython Unit 5
42 pages
Asfasdas
No ratings yet
Asfasdas
36 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Data Handling Module
No ratings yet
Data Handling Module
10 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Pandas 1
No ratings yet
Pandas 1
50 pages
Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
Data Retrieval & Cleaning Guide
No ratings yet
Data Retrieval & Cleaning Guide
35 pages
Learningthepandaslibrary PDF
100% (1)
Learningthepandaslibrary PDF
233 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Data Cleaning
No ratings yet
Data Cleaning
40 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
Pandas Module (Part-I)
No ratings yet
Pandas Module (Part-I)
36 pages
Pandas Puzzles for Data Science
100% (1)
Pandas Puzzles for Data Science
156 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
Ch01 - Introduction To Data Science
No ratings yet
Ch01 - Introduction To Data Science
65 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Pandas For Machine Learning
No ratings yet
Pandas For Machine Learning
10 pages
Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences
No ratings yet
Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences
6 pages
Data Science Exam Prep-Unit 2
No ratings yet
Data Science Exam Prep-Unit 2
18 pages
Pandas
No ratings yet
Pandas
13 pages
Lab-4, Data Wrangling With Python
No ratings yet
Lab-4, Data Wrangling With Python
11 pages
Lec 4
No ratings yet
Lec 4
9 pages
Pandas
No ratings yet
Pandas
21 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Advance Python Unit 4
No ratings yet
Advance Python Unit 4
13 pages
Pandas Notes
No ratings yet
Pandas Notes
3 pages
Unit2 Part2 Da
No ratings yet
Unit2 Part2 Da
45 pages
Effective Pandas. Patterns For Data Manipulation (Treading On Python) - Matt Harrison - Independently Published (2021)
100% (13)
Effective Pandas. Patterns For Data Manipulation (Treading On Python) - Matt Harrison - Independently Published (2021)
392 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
14 pages
Pandas For Python Pro Level Cheat Sheet
No ratings yet
Pandas For Python Pro Level Cheat Sheet
14 pages
Pandas
No ratings yet
Pandas
32 pages
Hduud
No ratings yet
Hduud
55 pages
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
Advanced Python Programming Data Science: The University of Sheffield
No ratings yet
Advanced Python Programming Data Science: The University of Sheffield
55 pages
Comprehensive Pandas Guide
No ratings yet
Comprehensive Pandas Guide
171 pages
Exercise 3
No ratings yet
Exercise 3
25 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Lecture Week5
No ratings yet
Lecture Week5
72 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
Data Cleaning Essentials
No ratings yet
Data Cleaning Essentials
42 pages
Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
Python Cheat Sheet Code Academy
100% (1)
Python Cheat Sheet Code Academy
1 page
Updated - Grade - 6 - Summer - Vacation - Task Latest
No ratings yet
Updated - Grade - 6 - Summer - Vacation - Task Latest
3 pages
Co Geometry2023re
No ratings yet
Co Geometry2023re
2 pages
Inbound 5662021172664986042
No ratings yet
Inbound 5662021172664986042
2 pages
Aniket Sidana: Web Developer Profile
No ratings yet
Aniket Sidana: Web Developer Profile
2 pages
Somnath Manna Resume: Skills & Experience
No ratings yet
Somnath Manna Resume: Skills & Experience
2 pages
8214
No ratings yet
8214
1 page
Alpha Phi Omega Epsilon Psi Alumni Association: Metropolitan Bank & Trust Company (MBTC)
100% (1)
Alpha Phi Omega Epsilon Psi Alumni Association: Metropolitan Bank & Trust Company (MBTC)
2 pages
OPP Self-Assessment Tool 09 Feb 2022
No ratings yet
OPP Self-Assessment Tool 09 Feb 2022
3 pages
99 - March 2012 LET Reviewer
No ratings yet
99 - March 2012 LET Reviewer
15 pages
Verb To Be Worksheet
No ratings yet
Verb To Be Worksheet
2 pages
English Vinglis Part 1
100% (1)
English Vinglis Part 1
101 pages
Project Integration Management 6th Edition Feb18th
100% (1)
Project Integration Management 6th Edition Feb18th
60 pages
Establishing Strategic Pay Plans: Compensation
No ratings yet
Establishing Strategic Pay Plans: Compensation
53 pages
Liberal Education: by Irshad Ali Sodhar (FSP) 2. Definition 3. Importance 4. Sphere of Liberal Education 5. Objectives
No ratings yet
Liberal Education: by Irshad Ali Sodhar (FSP) 2. Definition 3. Importance 4. Sphere of Liberal Education 5. Objectives
5 pages
AACR
No ratings yet
AACR
223 pages
Human Physiology, Biochemistry and Basic Medicine Full-Feature Download
No ratings yet
Human Physiology, Biochemistry and Basic Medicine Full-Feature Download
14 pages
PDF The Future of Learning Playbook A practical guide to navigating the changing landscape for creativity innovation and entrepreneurship Issn 1st Edition Kyriaki Papageorgiou John Bessant Olga Kokshagina download
100% (3)
PDF The Future of Learning Playbook A practical guide to navigating the changing landscape for creativity innovation and entrepreneurship Issn 1st Edition Kyriaki Papageorgiou John Bessant Olga Kokshagina download
55 pages
MBA Operations Research Exam 2019
No ratings yet
MBA Operations Research Exam 2019
2 pages
Parents' Experiences with Blind Kids
No ratings yet
Parents' Experiences with Blind Kids
18 pages
JSS 1 IRS 1st LESSON PLAN by AbdulQudus A
No ratings yet
JSS 1 IRS 1st LESSON PLAN by AbdulQudus A
9 pages
Jackson Stewart Resume 2020
No ratings yet
Jackson Stewart Resume 2020
1 page
Class 3 Maths
No ratings yet
Class 3 Maths
3 pages
Chronological Bibliography of Tucci
No ratings yet
Chronological Bibliography of Tucci
11 pages
English Grammar & Vocabulary Worksheet
No ratings yet
English Grammar & Vocabulary Worksheet
6 pages
Beyond Constructivism Models and Modeling Perspectives On Mathematics Problem Solving Learning and Teaching 1st Edition Richard A. Lesh Download
100% (1)
Beyond Constructivism Models and Modeling Perspectives On Mathematics Problem Solving Learning and Teaching 1st Edition Richard A. Lesh Download
37 pages
Asg 2
No ratings yet
Asg 2
2 pages
Final Allotment Letter
No ratings yet
Final Allotment Letter
1 page
Visio Spatial Sketchpad
No ratings yet
Visio Spatial Sketchpad
2 pages
Roboverse
No ratings yet
Roboverse
36 pages
Mcghee 1 Jarris Mcghee Argumentative Research Essay English 1102 WF
No ratings yet
Mcghee 1 Jarris Mcghee Argumentative Research Essay English 1102 WF
6 pages

Lecture 4 Data Pre-Processing

Uploaded by

Lecture 4 Data Pre-Processing

Uploaded by

APEX INSTITUTE OF TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

MACHINE LEARNING (21CSH-286)

Lecture – 4 DISCOVER . LEARN . EMPOWER

On completion of this course, the students shall be able to:-

• Panda is a python Library for analyzing data.

• Pandas allow us to analyze big data and make conclusions based on

• Pandas are used in Data Science.

• How to install Pandas?

By: Prof. (Dr.) Vineet Mehan 8

By: Prof. (Dr.) Vineet Mehan 9

By: Prof. (Dr.) Vineet Mehan 10

By: Prof. (Dr.) Vineet Mehan 11

By: Prof. (Dr.) Vineet Mehan 12

By: Prof. (Dr.) Vineet Mehan 13

• Create a simple Panda Data Frame

By: Prof. (Dr.) Vineet Mehan 14

By: Prof. (Dr.) Vineet Mehan 15

• Bad data could be:

• Data in wrong format

By: Prof. (Dr.) Vineet Mehan 16

By: Prof. (Dr.) Vineet Mehan 17

By: Prof. (Dr.) Vineet Mehan 18

By: Prof. (Dr.) Vineet Mehan 19

By: Prof. (Dr.) Vineet Mehan 20

• See Row 17 and 27 (removed)

By: Prof. (Dr.) Vineet Mehan 21

By default, the dropna() method returns a new

By: Prof. (Dr.) Vineet Mehan 22

If you want to change the original DataFrame, use

By: Prof. (Dr.) Vineet Mehan 23

See Row 17 replaced with 130

The fillna() method allows us to replace

It will Replace NULL values with the number 130.

By: Prof. (Dr.) Vineet Mehan 24

Values are replaced at position 17, 27, 91,

By: Prof. (Dr.) Vineet Mehan 25

• Median  Center value

• Mode  Most common occurring value

By: Prof. (Dr.) Vineet Mehan 26

Mean here is 375.790244

By: Prof. (Dr.) Vineet Mehan 27

Median here is 318.6

By: Prof. (Dr.) Vineet Mehan 28

Mode here is 300.0

By: Prof. (Dr.) Vineet Mehan 29

By: Prof. (Dr.) Vineet Mehan 30

In our example, it is most likely a typo,

By: Prof. (Dr.) Vineet Mehan 32

By: Prof. (Dr.) Vineet Mehan 33

• Value at position no 7 is removed

By: Prof. (Dr.) Vineet Mehan 35

• To discover duplicates, we can use the duplicated() method.

• The duplicated() method returns a Boolean values for each row.

By: Prof. (Dr.) Vineet Mehan 38

The duplicate row (row no 12) is now removed

By: Prof. (Dr.) Vineet Mehan 39

• Methods of Cleaning Data

• Design a model that is used to clean Empty cells, Data in wrong

By: Prof. (Dr.) Vineet Mehan 41

You might also like