[go: up one dir, main page]

0% found this document useful (0 votes)
195 views13 pages

Data Wrangling

The document discusses performing data wrangling operations in Python such as data exploration, dealing with missing values, reshaping data, filtering data, and merging dataframes. It provides code examples to demonstrate these techniques.

Uploaded by

mehir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
195 views13 pages

Data Wrangling

The document discusses performing data wrangling operations in Python such as data exploration, dealing with missing values, reshaping data, filtering data, and merging dataframes. It provides code examples to demonstrate these techniques.

Uploaded by

mehir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

• 1. Study installation and configuration of R programming.

• 2. Data Wrangling, IN Perform the following operations using Python on any open source dataset (e.g., data.csv)
a. Import all the required Python Libraries.
b. Locate open-source data from the web (e.g.,https://www.kaggle.com). Provide a clear description of the
data and its source (i.e., URL of the web site).
c. Load the Dataset into pandas data frame.
d. Data Preprocessing: check for missing values in the data using pandas isnull(), describe() function to get
some initial statistics. Provide variable descriptions. Types of variables etc. Check the dimensions of the data
frame.
e. Data Formatting and Data Normalization: Summarize the types of variables by checking the data types (i.e.,
character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the correct
data type, apply proper type conversions.
f. Turn categorical variables into quantitative variables in Python.
Practical based on Data Loading, Storage and File Formats
Data Wrangling
• Data Wrangling is the process of gathering, collecting, and
transforming Raw data into another format for better understanding,
decision-making, accessing, and analysis in less time. Data Wrangling
is also known as Data Munging.
Importance Of Data Wrangling

• Books selling Website want to show top-selling books of different domains,


according to user preference. For example, if a new user searches for
motivational books, then they want to show those motivational books which
sell the most or have a high rating, etc. 
• But on their website, there are plenty of raw data from different users. Here
the concept of Data Munging or Data Wrangling is used. As we know Data
wrangling is not by the System itself. This process is done by Data Scientists.
So, the data Scientist will wrangle data in such a way that they will sort the
motivational books that are sold more or have high ratings or user buy this
book with these package of Books, etc. On the basis of that, the new user
will make a choice. This will explain the importance of Data wrangling.
Data Wrangling in Python

• Data Wrangling is a crucial topic for Data Science and Data Analysis.
Pandas Framework of Python is used for Data Wrangling. Pandas is an
open-source library in Python specifically developed for Data Analysis
and Data Science. It is used for processes like data sorting or filtration,
Data grouping, etc
Data wrangling in Python deals with the below functionalities:

1.Data exploration: In this process, the data is studied, analyzed, and understood by
visualizing representations of data.
2.Dealing with missing values: Most of the datasets having a vast amount of data contain
missing values of NaN, they are needed to be taken care of by replacing them with mean,
mode, the most frequent value of the column, or simply by dropping the row having
a NaN value.
3.Reshaping data: In this process, data is manipulated according to the requirements, where
new data can be added or pre-existing data can be modified.
4.Filtering data: Some times datasets are comprised of unwanted rows or columns which are
required to be removed or filtered
5.Other: After dealing with the raw dataset with the above functionalities we get an efficient
dataset as per our requirements and then it can be used for a required purpose like data
analyzing, machine learning, data visualization, model training etc.
Data exploration in Python

# Import pandas package


import pandas as pd

# Assign data
data = {'Name': ['Jai', 'Princi', 'Gaurav','Anuj', 'Ravi', 'Natasha', 'Riya'],
'Age': [17, 17, 18, 17, 18, 17, 17],
'Gender': ['M', 'F', 'M', 'M', 'M', 'F', 'F'],
'Marks': [90, 76, 'NaN', 74, 65, 'NaN', 71]}

# Convert into DataFrame


df = pd.DataFrame(data)

# Display data
df
Dealing with missing values in Python

• As we can see from the previous output, there are NaN values present


in the MARKS column which is a missing value in the dataframe that
is going to be taken care of in data wrangling by replacing them with
the column mean.
• # Compute average
• c = avg = 0
• for ele in df['Marks']:
• if str(ele).isnumeric():
• c += 1
• avg += ele
• avg /= c
• # Replace missing values
• df = df.replace(to_replace="NaN",value=avg)
• # Display data
• df
Data Replacing in Data Wrangling

• In the GENDER column, we can replace the Gender column data by


categorizing them into different numbers.
• # Categorize gender
• df['Gender'] = df['Gender'].map({'M': 0,'F': 1, }).astype(float)
• # Display data
• df
Filtering data in Data Wrangling

• suppose there is a requirement for the details regarding


name, gender, and marks of the top-scoring students.
Here we need to remove some using the 
pandas slicing method in data wrangling from
unwanted data.
• # Filter top scoring students
• df = df[df['Marks'] >= 75].copy()
• # Remove age column from filtered DataFrame
• df.drop('Age', axis=1, inplace=True)
• # Display data
• df
Data Wrangling  Using Merge Operation

• Merge operation is used to merge two raw data into the desired format.
• Syntax: pd.merge( data_frame1,data_frame2, on=”field “) 
• Here the field is the name of the column which is similar in both data-
frame.
• For example: Suppose that a Teacher has two types of Data, the first
type of Data consists of Details of Students and the Second type of Data
Consist of Pending Fees Status which is taken from the Account Office.
So The Teacher will use the merge operation here in order to merge the
data and provide it meaning. So that teacher will analyze it easily and it
also reduces the time and effort of the Teacher from Manual Merging.
• Creating First Dataframe to Perform Merge Operation using Data Wrangling:
• # import module
• import pandas as pd

• # creating DataFrame for Student Details


• details = pd.DataFrame({
• 'ID': [101, 102, 103, 104, 105, 106,
• 107, 108, 109, 110],
• 'NAME': ['Jagroop', 'Praveen', 'Harjot',
• 'Pooja', 'Rahul', 'Nikita',
• 'Saurabh', 'Ayush', 'Dolly', "Mohit"],
• 'BRANCH': ['CSE', 'CSE', 'CSE', 'CSE', 'CSE',
• 'CSE', 'CSE', 'CSE', 'CSE', 'CSE']})

• # printing details
• print(details)

You might also like