Data Wrangling

The document discusses performing data wrangling operations in Python such as data exploration, dealing with missing values, reshaping data, filtering data, and merging dataframes. It provides code examples to demonstrate these techniques.

Uploaded by

mehir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

195 views13 pages

Data Wrangling

Uploaded by

mehir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 13

• 1. Study installation and configuration of R programming.

• 2. Data Wrangling, IN Perform the following operations using Python on any open source dataset (e.g., data.csv)
a. Import all the required Python Libraries.
b. Locate open-source data from the web (e.g.,https://www.kaggle.com). Provide a clear description of the
data and its source (i.e., URL of the web site).
c. Load the Dataset into pandas data frame.
d. Data Preprocessing: check for missing values in the data using pandas isnull(), describe() function to get
some initial statistics. Provide variable descriptions. Types of variables etc. Check the dimensions of the data
frame.
e. Data Formatting and Data Normalization: Summarize the types of variables by checking the data types (i.e.,
character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the correct
data type, apply proper type conversions.
f. Turn categorical variables into quantitative variables in Python.
Practical based on Data Loading, Storage and File Formats
Data Wrangling
• Data Wrangling is the process of gathering, collecting, and
transforming Raw data into another format for better understanding,
decision-making, accessing, and analysis in less time. Data Wrangling
is also known as Data Munging.
Importance Of Data Wrangling

• Books selling Website want to show top-selling books of different domains,

according to user preference. For example, if a new user searches for
motivational books, then they want to show those motivational books which
sell the most or have a high rating, etc.
• But on their website, there are plenty of raw data from different users. Here
the concept of Data Munging or Data Wrangling is used. As we know Data
wrangling is not by the System itself. This process is done by Data Scientists.
So, the data Scientist will wrangle data in such a way that they will sort the
motivational books that are sold more or have high ratings or user buy this
book with these package of Books, etc. On the basis of that, the new user
will make a choice. This will explain the importance of Data wrangling.
Data Wrangling in Python

• Data Wrangling is a crucial topic for Data Science and Data Analysis.
Pandas Framework of Python is used for Data Wrangling. Pandas is an
open-source library in Python specifically developed for Data Analysis
and Data Science. It is used for processes like data sorting or filtration,
Data grouping, etc
Data wrangling in Python deals with the below functionalities:

1.Data exploration: In this process, the data is studied, analyzed, and understood by
visualizing representations of data.
2.Dealing with missing values: Most of the datasets having a vast amount of data contain
missing values of NaN, they are needed to be taken care of by replacing them with mean,
mode, the most frequent value of the column, or simply by dropping the row having
a NaN value.
3.Reshaping data: In this process, data is manipulated according to the requirements, where
new data can be added or pre-existing data can be modified.
4.Filtering data: Some times datasets are comprised of unwanted rows or columns which are
required to be removed or filtered
5.Other: After dealing with the raw dataset with the above functionalities we get an efficient
dataset as per our requirements and then it can be used for a required purpose like data
analyzing, machine learning, data visualization, model training etc.
Data exploration in Python

# Import pandas package

import pandas as pd

# Assign data
data = {'Name': ['Jai', 'Princi', 'Gaurav','Anuj', 'Ravi', 'Natasha', 'Riya'],
'Age': [17, 17, 18, 17, 18, 17, 17],
'Gender': ['M', 'F', 'M', 'M', 'M', 'F', 'F'],
'Marks': [90, 76, 'NaN', 74, 65, 'NaN', 71]}

# Convert into DataFrame

df = pd.DataFrame(data)

# Display data
df
Dealing with missing values in Python

• As we can see from the previous output, there are NaN values present

in the MARKS column which is a missing value in the dataframe that
is going to be taken care of in data wrangling by replacing them with
the column mean.
• # Compute average
• c = avg = 0
• for ele in df['Marks']:
• if str(ele).isnumeric():
• c += 1
• avg += ele
• avg /= c
• # Replace missing values
• df = df.replace(to_replace="NaN",value=avg)
• # Display data
• df
Data Replacing in Data Wrangling

• In the GENDER column, we can replace the Gender column data by

categorizing them into different numbers.
• # Categorize gender
• df['Gender'] = df['Gender'].map({'M': 0,'F': 1, }).astype(float)
• # Display data
• df
Filtering data in Data Wrangling

• suppose there is a requirement for the details regarding

name, gender, and marks of the top-scoring students.
Here we need to remove some using the
pandas slicing method in data wrangling from
unwanted data.
• # Filter top scoring students
• df = df[df['Marks'] >= 75].copy()
• # Remove age column from filtered DataFrame
• df.drop('Age', axis=1, inplace=True)
• # Display data
• df
Data Wrangling Using Merge Operation

• Merge operation is used to merge two raw data into the desired format.
• Syntax: pd.merge( data_frame1,data_frame2, on=”field “)
• Here the field is the name of the column which is similar in both data-
frame.
• For example: Suppose that a Teacher has two types of Data, the first
type of Data consists of Details of Students and the Second type of Data
Consist of Pending Fees Status which is taken from the Account Office.
So The Teacher will use the merge operation here in order to merge the
data and provide it meaning. So that teacher will analyze it easily and it
also reduces the time and effort of the Teacher from Manual Merging.
• Creating First Dataframe to Perform Merge Operation using Data Wrangling:
• # import module
• import pandas as pd

• # creating DataFrame for Student Details

• details = pd.DataFrame({
• 'ID': [101, 102, 103, 104, 105, 106,
• 107, 108, 109, 110],
• 'NAME': ['Jagroop', 'Praveen', 'Harjot',
• 'Pooja', 'Rahul', 'Nikita',
• 'Saurabh', 'Ayush', 'Dolly', "Mohit"],
• 'BRANCH': ['CSE', 'CSE', 'CSE', 'CSE', 'CSE',
• 'CSE', 'CSE', 'CSE', 'CSE', 'CSE']})

• # printing details
• print(details)

Unit1 - AI - PPT AIT
No ratings yet
Unit1 - AI - PPT AIT
212 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
62 pages
Barclays Data Engineer Interview Questions
No ratings yet
Barclays Data Engineer Interview Questions
17 pages
Introduction To Python Libraries
No ratings yet
Introduction To Python Libraries
13 pages
Basics of Machine Learning
No ratings yet
Basics of Machine Learning
20 pages
Business Analytics Local Author Book 1
No ratings yet
Business Analytics Local Author Book 1
233 pages
Module 4 - Strings and String Manipulation - Python Programming
No ratings yet
Module 4 - Strings and String Manipulation - Python Programming
47 pages
Matrix-Vector Multiplication Using MapReduce in Big Data.
No ratings yet
Matrix-Vector Multiplication Using MapReduce in Big Data.
4 pages
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
100% (1)
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
8 pages
AVL Tree
No ratings yet
AVL Tree
27 pages
Pert 7 - Ethics and Privacy
No ratings yet
Pert 7 - Ethics and Privacy
18 pages
Unit 1
100% (1)
Unit 1
69 pages
Data Visualization R Programming Power Bi Lab Record
No ratings yet
Data Visualization R Programming Power Bi Lab Record
29 pages
SE 7204 BIG Data Analysis Unit I Final
No ratings yet
SE 7204 BIG Data Analysis Unit I Final
66 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
Unit V Data Visualization
No ratings yet
Unit V Data Visualization
49 pages
Practical No-2
No ratings yet
Practical No-2
4 pages
GWA - Lab Workbook
50% (2)
GWA - Lab Workbook
70 pages
Lab-manual-Advanced Python Programming 4321602
No ratings yet
Lab-manual-Advanced Python Programming 4321602
24 pages
Tableau Lab Manual
No ratings yet
Tableau Lab Manual
6 pages
Java Assignment
No ratings yet
Java Assignment
6 pages
Chapter 8 B - Trendlines and Regression Analysis
No ratings yet
Chapter 8 B - Trendlines and Regression Analysis
73 pages
FDS Unit 1
No ratings yet
FDS Unit 1
21 pages
Data Visualisation With Tableau
No ratings yet
Data Visualisation With Tableau
26 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
8 pages
R Programming UNIT-1
No ratings yet
R Programming UNIT-1
48 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Python Interview Questions
No ratings yet
Python Interview Questions
8 pages
Data Generalization
No ratings yet
Data Generalization
3 pages
IIT Kharagpur Data Science PDF
No ratings yet
IIT Kharagpur Data Science PDF
22 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Data Science Questions and Answers
No ratings yet
Data Science Questions and Answers
4 pages
Data Science PPT PD41
100% (1)
Data Science PPT PD41
8 pages
Unit-1 Basics of Algorithms and Mathematics
No ratings yet
Unit-1 Basics of Algorithms and Mathematics
47 pages
Data Wrangling
0% (1)
Data Wrangling
7 pages
Pandas Guide
No ratings yet
Pandas Guide
64 pages
2.data Analysis With Python by Rituraj Dixit - Z-Library
No ratings yet
2.data Analysis With Python by Rituraj Dixit - Z-Library
4 pages
Data Wrangling (Data Preprocessing) : Practical Assessment 1
No ratings yet
Data Wrangling (Data Preprocessing) : Practical Assessment 1
5 pages
Busa2001 2023 Sem2 Newcastle
No ratings yet
Busa2001 2023 Sem2 Newcastle
6 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
2nd Unit - 2.2 - Data Analytics
No ratings yet
2nd Unit - 2.2 - Data Analytics
22 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
34 pages
Lecture 4 Data Structure Linked List
No ratings yet
Lecture 4 Data Structure Linked List
30 pages
Memory Based Reasoning - BIA
100% (1)
Memory Based Reasoning - BIA
19 pages
Data Preprocessing: L1+ Freq
No ratings yet
Data Preprocessing: L1+ Freq
13 pages
Pythonic Data Cleaning With Numpy and Pandas
No ratings yet
Pythonic Data Cleaning With Numpy and Pandas
11 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
2 pages
DML Practical 2
No ratings yet
DML Practical 2
2 pages
Practical List of DBMS
No ratings yet
Practical List of DBMS
19 pages
Functional Dependencies and Normalization
No ratings yet
Functional Dependencies and Normalization
7 pages
Excel and Advance Excel: Sr. No 1
No ratings yet
Excel and Advance Excel: Sr. No 1
1 page
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
Excel - Data - Analysis - 03 - Useful Books - TutorialsPoint
No ratings yet
Excel - Data - Analysis - 03 - Useful Books - TutorialsPoint
1 page
Data Clustering..
No ratings yet
Data Clustering..
10 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
4 pages
4.7.1 - Data Warehousing Mining & Business Intelligence
No ratings yet
4.7.1 - Data Warehousing Mining & Business Intelligence
3 pages
Excel 2013/2016: Get Your Hands Dirty
From Everand
Excel 2013/2016: Get Your Hands Dirty
Sam Akrasi
No ratings yet
SAS Viya: The Python Perspective
From Everand
SAS Viya: The Python Perspective
Kevin D. Smith
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet

Data Wrangling

Uploaded by

Data Wrangling

Uploaded by

• 1. Study installation and configuration of R programming.

• Books selling Website want to show top-selling books of different domains,

# Import pandas package

# Convert into DataFrame

• As we can see from the previous output, there are NaN values present

• In the GENDER column, we can replace the Gender column data by

• suppose there is a requirement for the details regarding

• # creating DataFrame for Student Details

You might also like