Data Wrangling
Data Wrangling
• 2. Data Wrangling, IN Perform the following operations using Python on any open source dataset (e.g., data.csv)
a. Import all the required Python Libraries.
b. Locate open-source data from the web (e.g.,https://www.kaggle.com). Provide a clear description of the
data and its source (i.e., URL of the web site).
c. Load the Dataset into pandas data frame.
d. Data Preprocessing: check for missing values in the data using pandas isnull(), describe() function to get
some initial statistics. Provide variable descriptions. Types of variables etc. Check the dimensions of the data
frame.
e. Data Formatting and Data Normalization: Summarize the types of variables by checking the data types (i.e.,
character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the correct
data type, apply proper type conversions.
f. Turn categorical variables into quantitative variables in Python.
Practical based on Data Loading, Storage and File Formats
Data Wrangling
• Data Wrangling is the process of gathering, collecting, and
transforming Raw data into another format for better understanding,
decision-making, accessing, and analysis in less time. Data Wrangling
is also known as Data Munging.
Importance Of Data Wrangling
• Data Wrangling is a crucial topic for Data Science and Data Analysis.
Pandas Framework of Python is used for Data Wrangling. Pandas is an
open-source library in Python specifically developed for Data Analysis
and Data Science. It is used for processes like data sorting or filtration,
Data grouping, etc
Data wrangling in Python deals with the below functionalities:
1.Data exploration: In this process, the data is studied, analyzed, and understood by
visualizing representations of data.
2.Dealing with missing values: Most of the datasets having a vast amount of data contain
missing values of NaN, they are needed to be taken care of by replacing them with mean,
mode, the most frequent value of the column, or simply by dropping the row having
a NaN value.
3.Reshaping data: In this process, data is manipulated according to the requirements, where
new data can be added or pre-existing data can be modified.
4.Filtering data: Some times datasets are comprised of unwanted rows or columns which are
required to be removed or filtered
5.Other: After dealing with the raw dataset with the above functionalities we get an efficient
dataset as per our requirements and then it can be used for a required purpose like data
analyzing, machine learning, data visualization, model training etc.
Data exploration in Python
# Assign data
data = {'Name': ['Jai', 'Princi', 'Gaurav','Anuj', 'Ravi', 'Natasha', 'Riya'],
'Age': [17, 17, 18, 17, 18, 17, 17],
'Gender': ['M', 'F', 'M', 'M', 'M', 'F', 'F'],
'Marks': [90, 76, 'NaN', 74, 65, 'NaN', 71]}
# Display data
df
Dealing with missing values in Python
• Merge operation is used to merge two raw data into the desired format.
• Syntax: pd.merge( data_frame1,data_frame2, on=”field “)
• Here the field is the name of the column which is similar in both data-
frame.
• For example: Suppose that a Teacher has two types of Data, the first
type of Data consists of Details of Students and the Second type of Data
Consist of Pending Fees Status which is taken from the Account Office.
So The Teacher will use the merge operation here in order to merge the
data and provide it meaning. So that teacher will analyze it easily and it
also reduces the time and effort of the Teacher from Manual Merging.
• Creating First Dataframe to Perform Merge Operation using Data Wrangling:
• # import module
• import pandas as pd
• # printing details
• print(details)