Unit 4 Fod
Unit 4 Fod
UNIT IV
PYTHON LIBRARIES FOR DATA WRANGLING
COMPILED BY,
VERIFIED BY
2
LIST OF IMPORTANT QUESTIONS
UNIT IV
PYTHON LIBRARIES FOR DATA WRANGLING
PART – A
PART – B
1. Explain in detail about Data Wrangling in Python.
2. Explain the two main ways to carry out boolean masking.
3. Explain in detail about Aggregation in Pandas.
4. Pandas DataFrame - transform() function
5. Explain in detail about the pivot table using python.
3
LIST OF IMPORTANT QUESTIONS
UNIT IV
PYTHON LIBRARIES FOR DATA WRANGLING
PART – A
4
6. Write short note on Data Manipulation using Pandas.
• Dropping columns in the data.
• Dropping rows in the data.
• Renaming a column in the dataset.
• Select columns with specific data types.
• Slicing the dataset.
7. How can Pandas get missing data?
In order to check missing values in Pandas DataFrame, we use a function isnull()
and notnull(). Both function help in checking whether a value is NaN or not. These function
can also be used in Pandas Series in order to find null values in a series
8. How do you treat missing data in Python?
It is time to see the different methods to handle them.
1. Drop rows or columns that have a missing value.
2. Drop rows or columns that only have missing values.
3. Drop rows or columns based on a threshold value.
4. Drop based on a particular subset of columns.
5. Fill with a constant value.
6. Fill with an aggregated value.
9. How to use Hierarchical Indexes with Pandas?
#importing pandas library as alias pd
import pandas as pd
# calling the pandas read_csv() function.
# and storing the result in DataFrame df
df = pd.read_csv('homelessness.csv')
print(df.head())
5
10. List some of the Aggregation functions in Pandas.
Pandas provide us with a variety of aggregate functions. These functions help to perform
various activities on the datasets. The functions are:
• .count(): This gives a count of the data in a column.
• .sum(): This gives the sum of data in a column.
• .min() and .max(): This helps to find the minimum value and maximum value, ina
function, respectively.
• .mean() and .median(): Helps to find the mean and median, of the values in a
column, respectively.
11. What is grouping in pandas?
Pandas groupby is used for grouping the data according to the categories and
apply a function to the categories. It also helps to aggregate data efficiently. Pandas
dataframe. groupby() function is used to split the data into groups based on some criteria.
PART – B
1. Explain in detail about Data Wrangling in Python.
Data wrangling involves processing the data in various formats like - merging, grouping,
concatenating etc. for the purpose of analysing or getting them ready to be used with another
set of data. Python has built-in features to apply these wrangling methods to various data sets
to achieve the analytical goal. In this chapter we will look at few examples describing these
methods.
Merging Data
The Pandas library in python provides a single function, merge, as the entry point for all standard
database join operations between DataFrame objects −
Let us now create two different DataFrames and perform the merging operations on it.
6
# import the pandas library
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print left
print right
Name id subject_id
0 Alex 1 sub1
1 Amy 2 sub2
2 Allen 3 sub4
3 Alice 4 sub6
4 Ayoung 5 sub5
Name id subject_id
0 Billy 1 sub2
1 Brian 2 sub4
2 Bran 3 sub3
3 Bryce 4 sub6
4 Betty 5 sub5
Grouping Data
Grouping data sets is a frequent need in data analysis where we need the result in terms of
various groups present in the data set. Panadas has in-built methods which can roll the data into
various groups.
In the below example we group the data by year and then get the result for a specific year.
7
# import the pandas library
import pandas as pd
grouped = df.groupby('Year')
print grouped.get_group(2014)
Concatenating Data
Pandas provides various facilities for easily combining together Series, DataFrame,
and Panel objects. In the below example the concat function performs concatenation operations
along an axis. Let us create different objects and do concatenation.
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
8
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two])
The NumPy library in Python is a popular library for working with arrays. Boolean masking, also
called boolean indexing, is a feature in Python NumPy that allows for the filtering of values
in numpy arrays.
The first method returns an array with the required results. In this method, we pass a condition in
the indexing brackets, [ ], of an array. The condition can be any comparison, like arr > 5, for the
array arr.
9
Syntax
arr[arr > 5]
Parameter values
Return value
This method returns a NumPy array, ndarray, with values that satisfy the given condition. The line
in the example given above will return all the values in arr that are greater than 5.
Example
# importing NumPy
import numpy as np
# Creating a NumPy array
arr = np.arange(15)
# Printing our array to observe
print(arr)
# Using boolean masking to filter elements greater than or equal to 8
print(arr[arr >= 8])
# Using boolean masking to filter elements equal to 12
print(arr[arr == 12])
The second method returns a boolean array that has the same size as the array it represents.
A boolean array only contains the boolean values of either True or False. This boolean array is
also called a mask array, or simply a mask. We'll discuss boolean arrays in more detail in the
"Return value" section.
10
Syntax
The code snippet given below shows us how to use this method:
mask = arr > 5
Return value
• Return an array with the same size and dimensions as arr. This array will only contain the
values True and False. All the True values represent elements in the same position
in arr that satisfy our condition, and all the False values represent elements in the same
position in arr that do not satisfy our condition.
• Store this boolean array in a mask array.
The mask array can be passed in the index brackets of arr to return the values that satisfy our
condition. We will see how this works in our coding example.
Example
# importing NumPy
import numpy as np
# Creating a NumPy array
arr = np.array([[ 0, 9, 0],
[ 0, 7, 8],
[ 6, 0, 1]])
# Printing our array to observe
print(arr)
# Creating a mask array
mask = arr > 5
# Printing the mask array
print(mask)
# Printing the filtered array using both methods
print(arr[mask])
print(arr[arr > 5])
11
3. Explain in detail about Aggregation in Pandas.
Pandas provide us with a variety of aggregate functions. These functions help to perform
various activities on the datasets. The functions are:
• .count(): This gives a count of the data in a column.
• .sum(): This gives the sum of data in a column.
• .min() and .max(): This helps to find the minimum value and maximum value, ina
function, respectively.
• .mean() and .median(): Helps to find the mean and median, of the values in a
column, respectively.
Parameters:
It is used for aggregating the data. For a function, it must either work when passed to a DataFrame
or DataFrame.apply(). For a DataFrame, it can pass a dict, if the keys are the column names.
Returns:
scalar: It is being used when Series.agg is called with the single function.
Series: It is being used when DataFrame.agg is called for the single function.
DataFrame: It is being used when DataFrame.agg is called for the several functions.
Example:
1. import pandas as pd
2. import numpy as np
3. info=pd.DataFrame([[1,5,7],[10,12,15],[18,21,24],[np.nan,np.nan,np.nan]],columns=['X','Y'
,'Z'])
12
4. info.agg(['sum','min'])
13
4. Pandas DataFrame - transform() function
The transform() function is used to call function (func) on self producing a DataFrame with
transformed values and that has the same axis length as self.
Syntax:
DataFrame.transform(self, func, axis=0, *args, **kwargs)
Parameters:
Name Description Type/Default Required
Value /
Optional
func Function to use for transforming the data. If a function, str, list Required
or dict
function, must either work when passed a
DataFrame or when passed to
DataFrame.apply.
Accepted combinations are:
• function
• string function name
• list of functions and/or function names,
e.g. [np.exp. 'sqrt']
• dict of axis labels -> functions, function
names or list of such.
Returns: DataFrame
A DataFrame that must have the same length as self.
Raises: ValueError - If the returned DataFrame has a different length than self.
Example:
Examples
In [1]:
import numpy as np
import pandas as pd
In [2]:
df = pd.DataFrame({'X': range(4), 'Y': range(2, 6)})
df
Out[2]:
X Y
0 0 2
14
X Y
1 1 3
2 2 4
3 3 5
In [3]:
df.transform(lambda x: x + 1)
Out[3]:
X Y
0 1 3
1 2 4
2 3 5
3 4 6
Even though the resulting DataFrame must have the same length as the input DataFrame,
it is possible to provide several input functions:
In [4]:
s = pd.Series(range(4))
s
Out[4]:
0 0
1 1
2 2
3 3
dtype: int64
In [5]:
s.transform([np.sqrt, np.exp])
Out[5]:
sqrt exp
0 0.000000 1.000000
1 1.000000 2.718282
2 1.414214 7.389056
3 1.732051 20.085537
15
5. Explain in detail about the pivot table using python.
Most people likely have experience with pivot tables in Ecel. Pandas provides a similar function
called (appropriately enough) pivot_table . While it is exceedingly useful, I frequently find myself
struggling to remember how to use the syntax to format the output for my needs. This article will
focus on explaining the pandas pivot_table function and how to use it for your data analysis.
As an added bonus, I’ve created a simple cheat sheet that summarizes the pivot_table. We can
find it at the end of this post and I hope it serves as a useful reference. Let me know if it
is helpful.
The Data
One of the challenges with using the panda’s pivot_table is making sure us understand our data
and what questions we are trying to answer with the pivot table. It is a seemingly simple function
but can produce very powerful analysis very quickly.
In this scenario, I’m going to be tracking a sales pipeline (also called funnel). The basic problem
is that some sales cycles are very long (think “enterprise software”, capital equipment, etc.) and
management wants to understand it in more detail throughout the year.
Many companies will have CRM tools or other software that sales uses to track the process. While
they may have useful tools for analyzing the data, inevitably someone will export the data to Excel
and use a PivotTable to summarize the data.
import pandas as pd
import numpy as np
Version Warning
The pivot_table API has changed over time so please make sure we have a recent version of
pandas ( > 0.15) installed for this example to work. This example also uses the category data
type which requires a recent version as well.
Read in our sales funnel data into our DataFrame
df = pd.read_excel("../in/sales-funnel.xlsx")
df.head()
For convenience sake, let’s define the status column as a category and set the order we want
to view.
This isn’t strictly required but helps us keep the order we want as we work through analyzing
the data.
17
df["Status"] = df["Status"].astype("category")
df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)
As we build up the pivot table, I think it’s easiest to take it one step at a time. Add items and check
each step to verify we are getting the results we expect. Don’t be afraid to play with the order and
the variables to see what presentation makes the most sense for our needs.
The simplest pivot table must have a dataframe and an index . In this case, let’s use the Name
as our index.
pd.pivot_table(df,index=["Name"])
We can have multiple indexes as well. In fact, most of the pivot_table args can take multiple
values via a list.
18
pd.pivot_table(df,index=["Name","Rep","Manager"])
This is interesting but not particularly useful. What we probably want to do is look at this by
Manager and Rep. It’s easy enough to do by changing the index .
pd.pivot_table(df,index=["Manager","Rep"])
We can see that the pivot table is smart enough to start aggregating the data and summarizing it
by grouping the reps with their managers. Now we start to get a glimpse of what a pivot table can
do for us.
For this purpose, the Account and Quantity columns aren’t really useful. Let’s remove it by
explicitly defining the columns we care about using the values field.
19
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"])
The price column automatically averages the data but we can do a count or a sum. Adding them
is simple using aggfunc and np.sum .
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=np.sum)
aggfunc can take a list of functions. Let’s try a mean using the numpy mean function and len to
get a count.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=[np.mean,len]
)
20
21