FDS UNIT 4 JPR
FDS UNIT 4 JPR
Python
Python is a general purpose, dynamic, high-level, and interpreted programming language
Python is easy to learn yet powerful and versatile scripting language, which makes it
attractivefor Application Development.
Python's syntax and dynamic typing with its interpreted nature make it an ideal language
for scripting and rapid application development.
Python supports multiple programming pattern, including object-oriented, imperative,
and functional or procedural programming styles.
Python is not intended to work in a particular area, such as web programming. It is
known is multipurpose programming language because it can be used with web,
enterprise, 3D CAD, etc.
Python Arrays
• An array is defined as a collection of items that are stored at contiguous memory locations.
• It is a container which can hold a fixed number of items, and these items should be of the same
type.
Creation of Arrays
The Array can be created in Python by importing the array module to the python program. from
array import *
arrayName = array(typecode, [initializers])
Accessing array elements
Access the array elements using the respective indices of those elements.
import array as arr
a = arr.array('i', [2, 4, 6, 8]) print("First element:", a[0]) //2
print("Second element:", a[1]) //4
print("Second last element:", a[-1]) // 8
change or add elements
The length of an array is defined as the number of elements present in an array. It returns an
integer value that is equal to the total number of the elements present in that array.
Syntax len(array_name)
Inserting Values
Insert new data elements at specific position by using the insert() method and specifying the
index. Example - a new data element is inserted at index position 2.
from array import *
T = [[11, 12, 5, 2], [15, 6,10], [10, 8, 12, 5], [12,15,8,6]]
T.insert(2, [0,5,11,13,6])
for r in T: for c in r:
print(c,end = " ") print()
Updating Values
We can update the entire inner array or some specific data elements of the inner array by
reassigning the values using the array index.
from array import *
T = [[11, 12, 5, 2], [15, 6,10], [10, 8, 12, 5], [12,15,8,6]]
T[2] = [11,9]
T[0][3] = 7
for r in T:
for c in r:
print(c,end = " ")
print()
NumPy
NumPy is a Python library used for working with arrays.
It also has functions for working in domain of linear algebra, fourier transform, and matrices.
NumPy was created in 2005 by Travis Oliphant.
It is an open source project and you can use it freely.
NumPy stands for Numerical Python. The array object in NumPy is called ndarray, it provides a
lot of supporting functions that make working with ndarray very easy.
Arrays are very frequently used in data science, where speed and resources are very important.
NumPy Faster Than Lists
NumPy arrays are stored at one continuous place in memory unlike lists, so processes can
access and manipulate them very efficiently. This is the main reason why NumPy is faster than
lists. Also it is optimized to work with latest CPU architectures.
np.array([1,2,3]) 1d array
np.array([(1,2,3),(4,5,6)]) 2d array
Linspace
The linspace() function returns the evenly spaced values over the given interval. The following
example returns the 10 evenly separated values over the given interval 5-15
Example
import numpy as np
a=np.linspace(5,15,10) #prints 10 values which are evenly spaced over the given interval 5-15
print(a)
Output:
[ 5. 6.11111111 7.22222222 8.33333333 9.44444444 10.55555556
11.66666667 12.77777778 13.88888889 15. ]
Finding the maximum, minimum, and sum of the array elements
The NumPy provides the max(), min(), and sum() functions which are used to find the
maximum, minimum, and sum of the array elements respectively.
Example
import numpy as np
a = np.array([1,2,3,10,15,4])
print("The array:",a)
print("The maximum element:",a.max())
print("The minimum element:",a.min())
print("The sum of the elements:",a.sum())
Output:
The array: [ 1 2 3 10 15 4] The maximum element: 15 The minimum element: 1 The sum of the
elements: 35
To calculate the maximum element among each column, the minimum element among each
row, andthe addition of all the row elements, consider the following example.
Example
import numpy as np
a = np.array([[1,2,30],[10,15,4]])
print("The array:",a)
print("The maximum elements of columns:",a.max(axis = 0)) print("The minimum element of
rows",a.min(axis = 1)) print("The sum of all rows",a.sum(axis = 1))
Output:
The array: [[1 2 30]
[10 15 4]]
The maximum elements of columns: [10 15 30] The minimum element of rows [1 4]
The sum of all rows [33 29]
Finding square root and standard deviation
The sqrt() and std() functions associated with the numpy array are used to find the square
root and standard deviation of the array elements respectively.
Standard deviation means how much each element of the array varies from the mean value
of the numpy array.
Example
import numpy as np
a = np.array([[1,2,30],[10,15,4]])
print(np.sqrt(a)) print(np.std(a))
Output:
[[1. 1.41421356 5.47722558]
[3.16227766 3.87298335 2. ]]
10.044346115546242
Array Concatenation
The numpy provides us with the vertical stacking and horizontal stacking which allows us to
concatenate two multi-dimensional arrays vertically or horizontally.
Example
import numpy as np
a = np.array([[1,2,30],[10,15,4]])
b = np.array([[1,2,3],[12, 19, 29]])
print("Arrays vertically concatenated\n",np.vstack((a,b))); print("Arrays horizontally
concatenated\n",np.hstack((a,b))) Output:
Arrays vertically concatenated [[ 1 2 30]
[10 15 4]
[ 1 2 3]
[12 19 29]]
Arrays horizontally concatenated [[ 1 2 30 1 2 3]
[10 15 4 12 19 29]]
Numpy.empty
As the name specifies, The empty routine is used to create an uninitialized array of specified
shape and data type. The syntax is given below.
numpy.empty(shape, dtype = float, order = 'C') It accepts the following parameters.
Shape: The desired shape of the specified array.
dtype: The data type of the array items. The default is the float.
Order: The default order is the c-style row-major order. It can be set to F for FORTRAN-style
column- major order.
Example
import numpy as np
arr = np.empty((3,2), dtype = int) print(arr)
Output:
[[ 140482883954664 36917984]
[ 140482883954648 140482883954648]
[6497921830368665435 172026472699604272]]
NumPy.Zeros
This routine is used to create the numpy array with the specified shape where each numpy
array item is initialized to 0.
The syntax is given below. numpy.zeros(shape, dtype = float, order = 'C')
Shape: The desired shape of the specified array.
dtype: The data type of the array items. The default is the float.
Order: The default order is the c-style row-major order. It can be set to F for FORTRAN-style
column- major order.
Example
import numpy as np
arr = np.zeros((3,2), dtype = int) print(arr)
Output:
[[0 0]
[0 0]
[0 0]]
NumPy.ones
This routine is used to create the numpy array with the specified shape where each numpy
array item is initialized to 1.
The syntax to use this module is given below. numpy.ones(shape, dtype = none, order = 'C') It
accepts the following parameters.
Shape: The desired shape of the specified array. dtype: The data type of the array items.
Order: The default order is the c-style row-major order. It can be set to F for FORTRAN-style
column- major order.
Example
import numpy as np
arr = np.ones((3,2), dtype = int) print(arr)
Output:
[[1 1]
[1 1]
[1 1]]
Numpy array from existing data
NumPy provides us the way to create an array by using the existing data. numpy.asarray
This routine is used to create an array by using the existing data in the form of lists, or tuples.
This routine is useful in the scenario where we need to convert a python sequence into the
numpy array object.
numpy.asarray(sequence, dtype = None, order = None) It accepts the following parameters.
sequence: It is the python sequence which is to be converted into the python array. dtype: It is
the data type of each item of the array.
order: It can be set to C or F. The default is C.
numpy.frombuffer
This function is used to create an array by using the specified buffer. numpy.frombuffer(buffer,
type = float, count = -1, offset = 0)
numpy.fromiter
This routine is used to create a ndarray by using an iterable object. It returns a one-dimensional
ndarray object.
Example
import numpy as np list = [0,2,4,6]
it = iter(list)
x = np.fromiter(it, dtype = float) print(x)
print(type(x))
Output:
[0. 2. 4. 6.]
<class 'numpy.ndarray'>
An array can have any number of dimensions. When the array is created, you can define the
number of dimensions by using the ndmin argument.
Example
Create an array with 5 dimensions and verify that it has 5 dimensions:
import numpy as np
arr = np.array([1, 2, 3, 4], ndmin=5)
print(arr)
print('number of dimensions :', arr.ndim)
Vector
In Python, vector is a single one-dimension array of lists and behaves same as a Python list
The vector dot product performs between the two same-length sequential vectors and
returns the single dot product. The .dot() method is used to perform the dot product.
list1 = [10,20,30,40,50]
list2 = [5,2,4,3,1]
vtr1 = np.array(list1) vtr2= np.array(list2)
print("We create vector from a list 1:")
print(vtr1)
print("We create a vector from a list 2:") print(vtr2)
vtr_product = vtr1.dot(vtr2)
print("Dot product of two vectors: ",vtr_product)
output:
Fancy indexing allows you to use one array as an index into another array.
It can be used to pick values from an array, and also to implement table lookup algorithms.
Example:
a = np.array([-1.0, 2.0, -3.0, 4.0])
idx = np.array([2, 1, 1, 0, 3])
r = a[idx]
r contains:
[-3. 2. 2. -1. 4.]
When using fancy indexing, the shape of the result reflects the shape of the index arrays
rather than the shape of the array being indexed:
import numpy as np
rand = np.random.RandomState(42)
x = rand.randint(100, size=10)
print(x)
[51 92 14 71 60 20 82 86 74 74]
ind = np.array([[3, 7],[4, 5]])
x[ind] array([[71, 86],[60, 20]])
Fancy indexing also works in multiple dimensions. Consider the following array: X =
np.arange(12).reshape((3, 4))
X
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
Like with standard indexing, the first index refers to the row, and the second to the column:
Combined Indexing
For even more powerful operations, fancy indexing can be combined with the other indexing
schemes. X = np.arange(12).reshape((3, 4))
print(X)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
We can combine fancy and simple indices: X[2, [2, 0, 1]]
array([10, 8, 9])
We can also combine fancy indexing with slicing: X[1:, [2, 0, 1]]
array([[ 6, 4, 5],
[10, 8, 9]])
And we can combine fancy indexing with masking: mask = np.array([1, 0, 1, 0], dtype=bool)
X[row[:, np.newaxis], mask]
array([[ 0, 2],
[ 4, 6],
[ 8, 10]])
Transpose a Matrix
The .T property is used to find the Transpose of a Matrix −
Example
import numpy as np
# A matrix
mx = np.array([[5, 10], [15, 20]])
print("Matrix =\n",mx)
print ("\nThe Transpose =") print (mx.T)
This section covers the use of Boolean masks to examine and manipulate values within
NumPy arrays. Masking comes up when you want to extract, modify, count, or otherwise
manipulate values in an array based on some criterion: for example, you might wish to
count all values greater than a certain value, or perhaps remove all outliers that are above
some threshold. In NumPy, Boolean masking is often the most efficient way to accomplish
these types of tasks.
Comparison Operators as ufuncs
In Computation on NumPy Arrays: Universal Functions we introduced ufuncs, and focused
in particular on arithmetic operators. We saw that using +, -, *, /, and others on arrays
leads to element-wise operations. NumPy also implements comparison operators such
as < (less than) and > (greater than) as element-wise ufuncs. The result of these
comparison operators is always an array with a Boolean data type. All six of the standard
comparison operations are available:
x = np.array([1, 2, 3, 4, 5])
x < 3 # less than
array([ True, True, False, False, False], dtype=bool)
x > 3 # greater than
array([False, False, False, True, True], dtype=bool)
x <= 3 # less than or equal
array([ True, True, True, False, False], dtype=bool)
x >= 3 # greater than or equal
array([False, False, True, True, True], dtype=bool)
x != 3 # not equal
array([ True, True, False, True, True], dtype=bool)
x == 3 # equal
array([False, False, True, False, False], dtype=bool)
It is also possible to do an element-wise comparison of two arrays, and to include
compound expressions:
(2 * x) == (x ** 2)
array([False, True, False, False, False], dtype=bool)
As in the case of arithmetic operators, the comparison operators are implemented as
ufuncs in NumPy; for example, when you write x < 3, internally NumPy uses np.less(x, 3).
A summary of the comparison operators and their equivalent ufunc is shown here:
Operator Equivalent ufunc Operator Equivalent ufunc
== np.equal != np.not_equal
< np.less <= np.less_equal
> np.greater >= np.greater_equal
Just as in the case of arithmetic ufuncs, these will work on arrays of any size and shape.
Here is a two-dimensional example:
rng = np.random.RandomState(0)
x = rng.randint(10, size=(3, 4))
x
array([[5, 0, 3, 3],
[7, 9, 3, 5],
[2, 4, 7, 6]])
x<6
array([[ True, True, True, True],
[False, False, True, True],
[ True, True, False, False]], dtype=bool)
In each case, the result is a Boolean array, and NumPy provides a number of
straightforward patterns for working with these Boolean results.
Working with Boolean Arrays
Given a Boolean array, there are a host of useful operations you can do. We'll work with x,
the two-dimensional array we created earlier.
print(x)
[[5 0 3 3]
[7 9 3 5]
[2 4 7 6]]
Counting entries
To count the number of True entries in a Boolean array, np.count_nonzero is useful:
# how many values less than 6?
np.count_nonzero(x < 6)
8
We see that there are eight array entries that are less than 6. Another way to get at this
information is to use np.sum; in this case, False is interpreted as 0, and True is interpreted
as 1:
np.sum(x < 6)
8
The benefit of sum() is that like with other NumPy aggregation functions, this summation
can be done along rows or columns as well:
# how many values less than 6 in each row?
np.sum(x < 6, axis=1)
array([4, 2, 2])
This counts the number of values less than 6 in each row of the matrix.
If we're interested in quickly checking whether any or all the values are true, we can use
(you guessed it) np.any or np.all:
# are there any values greater than 8?
np.any(x > 8)
True
# are there any values less than zero?
np.any(x < 0)
False
# are all values less than 10?
np.all(x < 10)
True
# are all values equal to 6?
np.all(x == 6)
False
np.all and np.any can be used along particular axes as well. For example:
# are all values in each row less than 8?
np.all(x < 8, axis=1)
array([ True, False, True], dtype=bool)
Here all the elements in the first and third rows are less than 8, while this is not the case
for the second row.
Finally, a quick warning: as mentioned in Aggregations: Min, Max, and Everything In
Between, Python has built-in sum(), any(), and all() functions. These have a different
syntax than the NumPy versions, and in particular will fail or produce unintended results
when used on multidimensional arrays. Be sure that you are using np.sum(), np.any(),
and np.all() for these examples!
Boolean Arrays as Masks
In the preceding section we looked at aggregates computed directly on Boolean arrays. A
more powerful pattern is to use Boolean arrays as masks, to select particular subsets of
the data themselves. Returning to our x array from before, suppose we want an array of all
values in the array that are less than, say, 5:
x
array([[5, 0, 3, 3],
[7, 9, 3, 5],
[2, 4, 7, 6]])
We can obtain a Boolean array for this condition easily, as we've already seen:
x<5
array([[False, True, True, True],
[False, False, True, False],
[ True, True, False, False]], dtype=bool)
Now to select these values from the array, we can simply index on this Boolean array; this
is known as a masking operation:
x[x < 5]
array([0, 3, 3, 3, 2, 4])
What is returned is a one-dimensional array filled with all the values that meet this
condition; in other words, all the values in positions at which the mask array is True.
We are then free to operate on these values as we wish. For example, we can compute
some relevant statistics on our Seattle rain data:
# construct a mask of all rainy days
rainy = (inches > 0)
# construct a mask of all summer days (June 21st is the 172nd day)
days = np.arange(365)
summer = (days > 172) & (days < 262)
Each record in array student has a structure of class Struct. The array of a
structure is referred to as struct as adding any new fields for a new struct in the
array, contains the empty array.
Example 1:
# Python program to demonstrate
# Structured array
import numpy as np
a = np.array([('Sana', 2, 21.0), ('Mansi', 7, 29.0)],
dtype=[('name', (np.str_, 10)), ('age', np.int32), ('weight', np.float64)])
print(a)
Output:
[('Sana', 2, 21.0) ('Mansi', 7, 29.0)]
Data Manipulation with Pandas
Pandas is an open-source library that is used from data manipulation to data analysis &
is very powerful, flexible & easy to use tool which can be imported using import pandas
as pd. Pandas deal essentially with data in 1-D and 2-D arrays; Although, pandas handles
these two differently. In pandas, 1-D arrays are stated as a series & a dataframe is simply
a 2-D array. The dataset used here is country_code.csv.
Below are various operations used to manipulate the dataframe:
First, import the library which is used in data manipulation i.e. pandas then assign and
read the dataframe:
# import module
import pandas as pd
# assign dataset
df = pd.read_csv("country_code.csv")
# display
print("Type-", type(df))
df
We can read the dataframe by using head() function also which is having an argument
(n) i.e. number of rows to be displayed.
df.head(10)
Counting the rows and columns in DataFrame using shape(). It returns the no. of rows
and columns enclosed in a tuple.
df.shape
summary of Statistics of DataFrame using describe() method.
df.describe()
Dropping the missing values in DataFrame, it can be done using the dropna() method, it
removes all the NaN values in the dataframe.
df.dropna()
df.dropna(axis=1)- This will drop all the columns with any missing values.
Merging DataFrames using merge(), arguments passed are the dataframes to be merged
along with the column name.
df1 = pd.read_csv("country_code.csv")
merged_col = pd.merge(df, df1, on='Name')
merged_col
Renaming the columns of dataframe using rename(), arguments passed are the columns
to be renamed & inplace.
country_code = df.rename(columns={'Name': 'CountryName',
'Code': 'CountryCode'},
inplace=False)
country_code
Creating a dataframe manually:
student = pd.DataFrame({'Name': ['Rohan', 'Rahul', 'Gaurav',
'Ananya', 'Vinay', 'Rohan',
'Vivek', 'Vinay'],
student.sort_values(by=['Score'], ascending=True)
student.sort_values(by=['Name', 'Score'],
ascending=[True, False])
Creating another column in DataFrame, Here we will create column name percentage
which will calculate the percentage of student score by using aggregate function sum().
student['Percentage'] = (student['Score'] / student['Score'].sum()) * 100
student
Selecting DataFrame rows using logical operators:
# Selecting rows where score is
# greater than 70
print(student[student.Score>70])
# Selecting rows where score is greater than 60
# OR less than 70
print(student[(student.Score>60) | (student.Score<70)])
Indexing & Slicing :
Here .loc is label base & .iloc is integer position based methods used for slicing &
indexing of data.
# Printing five rows with name column only
# i.e. printing first 5 student names.
print(student.loc[0:4, 'Name'])
# Reading Dataframe
student
Hierarchical indexing
For enhancing the capabilities of Data Processing, we have to use some indexing that helps
to sort the data based on the labels. So, Hierarchical indexing is comes into the picture and
defined as an essential feature of pandas that helps us to use the multiple index levels.
Creating multiple index
In Hierarchical indexing, we have to create multiple indexes for the data. This example
creates a series with multiple indexes.
import pandas as pd
info = pd.Series([11, 14, 17, 24, 19, 32, 34, 27],
index = [['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y'],
['obj1', 'obj2', 'obj3', 'obj4', 'obj1', 'obj2', 'obj3', 'obj4']])
data
We have taken two level of index here i.e. (a, b) and (obj1,..., obj4) and can see the index
by using 'index' command.
info.index
Output:
MultiIndex(levels=[['x', 'y'], ['obj1', 'obj2', 'obj3', 'obj4']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]])
Partial indexing
Partial indexing can be defined as a way to choose the particular index from a hierarchical
indexing.
Below code is extracting 'b' from the data,
import pandas as pd
info = pd.Series([11, 14, 17, 24, 19, 32, 34, 27],
index = [['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y'],
['obj1', 'obj2', 'obj3', 'obj4', 'obj1', 'obj2', 'obj3', 'obj4']])
info['b']
>>> climate_precip.head()
STATION ... DLY-SNOW-PCTALL-GE050TI
0 GHCND:USC00049099 ... -9999
1 GHCND:USC00049099 ... -9999
2 GHCND:USC00049099 ... -9999
3 GHCND:USC00049099 ... 0
4 GHCND:USC00049099 ... 0
Here, you used .head() to get the first five rows of each DataFrame. Make sure to try this on
your own, either with the interactive Jupyter Notebook or in your console, so that you can
explore the data in greater depth.
Next, take a quick look at the dimensions of the two DataFrames:
>>>
>>> climate_temp.shape
(127020, 21)
>>> climate_precip.shape
(151110, 29)
Note that .shape is a property of DataFrame objects that tells you the dimensions of the
DataFrame. For climate_temp, the output of .shape says that the DataFrame has 127,020
rows and 21 columns.
Inner Join
In this example, you’ll use merge() with its default arguments, which will result in an inner
join. Remember that in an inner join, you’ll lose rows that don’t have a match in the other
DataFrame’s key column.
With the two datasets loaded into DataFrame objects, you’ll select a small slice of the
precipitation dataset and then use a plain merge() call to do an inner join. This will result in
a smaller, more focused dataset:
>>>
>>> precip_one_station = climate_precip.query("STATION == 'GHCND:USC00045721'")
>>> precip_one_station.head()
STATION ... DLY-SNOW-PCTALL-GE050TI
1460 GHCND:USC00045721 ... -9999
1461 GHCND:USC00045721 ... -9999
1462 GHCND:USC00045721 ... -9999
1463 GHCND:USC00045721 ... -9999
1464 GHCND:USC00045721 ... -9999
Here you’ve created a new DataFrame called precip_one_station from
the climate_precip DataFrame, selecting only rows in which the STATION field
is "GHCND:USC00045721".
If you check the shape attribute, then you’ll see that it has 365 rows. When you do the
merge, how many rows do you think you’ll get in the merged DataFrame? Remember that
you’ll be doing an inner join:
>>>
>>> inner_merged = pd.merge(precip_one_station, climate_temp)
>>> inner_merged.head()
STATION STATION_NAME ... DLY-HTDD-BASE60 DLY-HTDD-NORMAL
0 GHCND:USC00045721 MITCHELL CAVERNS CA US ... 14 19
1 GHCND:USC00045721 MITCHELL CAVERNS CA US ... 14 19
2 GHCND:USC00045721 MITCHELL CAVERNS CA US ... 14 19
3 GHCND:USC00045721 MITCHELL CAVERNS CA US ... 14 19
4 GHCND:USC00045721 MITCHELL CAVERNS CA US ... 14 19
>>> inner_merged.shape
(365, 47)
If you guessed 365 rows, then you were correct! This is because merge() defaults to an
inner join, and an inner join will discard only those rows that don’t match. Because all of
your rows had a match, none were lost. You should also notice that there are many more
columns now: 47 to be exact.
With merge(), you also have control over which column(s) to join on. Let’s say that you
want to merge both entire datasets, but only on Station and Date since the combination of
the two will yield a unique value for each row. To do so, you can use the on parameter:
>>>
>>> inner_merged_total = pd.merge(
... climate_temp, climate_precip, on=["STATION", "DATE"]
... )
>>> inner_merged_total.shape
(123005, 48)
You can specify a single key column with a string or multiple key columns with a list. This
results in a DataFrame with 123,005 rows and 48 columns.
Why 48 columns instead of 47? Because you specified the key columns to join on, pandas
doesn’t try to merge all mergeable columns. This can result in “duplicate” column names,
which may or may not have different values.
“Duplicate” is in quotation marks because the column names will not be an exact match. By
default, they are appended with _x and _y. You can also use the suffixes parameter to
control what’s appended to the column names.
To prevent surprises, all the following examples will use the on parameter to specify the
column or columns on which to join.
Outer Join
Here, you’ll specify an outer join with the how parameter. Remember from the diagrams
above that in an outer join—also known as a full outer join—all rows from both
DataFrames will be present in the new DataFrame.
If a row doesn’t have a match in the other DataFrame based on the key column(s), then you
won’t lose the row like you would with an inner join. Instead, the row will be in the merged
DataFrame, with NaN values filled in where appropriate.
This is best illustrated in an example:
>>>
>>> outer_merged = pd.merge(
... precip_one_station, climate_temp, how="outer", on=["STATION", "DATE"]
... )
>>> outer_merged.shape
(127020, 48)
If you remember from when you checked the .shape attribute of climate_temp, then you’ll
see that the number of rows in outer_merged is the same. With an outer join, you can
expect to have the same number of rows as the larger DataFrame. That’s because no rows
are lost in an outer join, even when they don’t have a match in the other DataFrame.
Left Join
In this example, you’ll specify a left join—also known as a left outer join—with
the how parameter. Using a left outer join will leave your new merged DataFrame with all
rows from the left DataFrame, while discarding rows from the right DataFrame that don’t
have a match in the key column of the left DataFrame.
You can think of this as a half-outer, half-inner merge. The example below shows you this in
action:
>>>
>>> left_merged = pd.merge(
... climate_temp, precip_one_station, how="left", on=["STATION", "DATE"]
... )
>>> left_merged.shape
(127020, 48)
left_merged has 127,020 rows, matching the number of rows in the left
DataFrame, climate_temp. To prove that this only holds for the left DataFrame, run the
same code, but change the position of precip_one_station and climate_temp:
>>>
>>> left_merged_reversed = pd.merge(
... precip_one_station, climate_temp, how="left", on=["STATION", "DATE"]
... )
>>> left_merged_reversed.shape
(365, 48)
This results in a DataFrame with 365 rows, matching the number of rows
in precip_one_station.
Right Join
The right join, or right outer join, is the mirror-image version of the left join. With this
join, all rows from the right DataFrame will be retained, while rows in the left DataFrame
without a match in the key column of the right DataFrame will be discarded.
To demonstrate how right and left joins are mirror images of each other, in the example
below you’ll recreate the left_merged DataFrame from above, only this time using a right
join:
>>>
>>> right_merged = pd.merge(
... precip_one_station, climate_temp, how="right", on=["STATION", "DATE"]
... )
>>> right_merged.shape
(127020, 48)
Here, you simply flipped the positions of the input DataFrames and specified a right join.
When you inspect right_merged, you might notice that it’s not exactly the same
as left_merged. The only difference between the two is the order of the columns: the first
input’s columns will always be the first in the newly formed DataFrame.
merge() is the most complex of the pandas data combination tools. It’s also the foundation
on which the other tools are built. Its complexity is its greatest strength, allowing you to
combine datasets in every which way and to generate new insights into your data.
On the other hand, this complexity makes merge() difficult to use without an intuitive grasp
of set theory and database operations. In this section, you’ve learned about the various data
merging techniques, as well as many-to-one and many-to-many merges, which ultimately
come from set theory. For more information on set theory, check out Sets in Python.
Now, you’ll look at .join(), a simplified version of merge().
pandas .join(): Combining Data on a Column or Index
While merge() is a module function, .join() is an instance method that lives on your
DataFrame. This enables you to specify only one DataFrame, which will join the DataFrame
you call .join() on.
Under the hood, .join() uses merge(), but it provides a more efficient way to join
DataFrames than a fully specified merge() call. Before diving into the options available to
you, take a look at this short example:
>>>
>>> precip_one_station.join(
... climate_temp, lsuffix="_left", rsuffix="_right"
... ).shape
(365, 50)
With the indices visible, you can see a left join happening here,
with precip_one_station being the left DataFrame. You might notice that this example
provides the parameters lsuffix and rsuffix. Because .join() joins on indices and doesn’t
directly merge DataFrames, all columns—even those with matching names—are retained
in the resulting DataFrame.
Now flip the previous example around and instead call .join() on the larger DataFrame:
>>>
>>> climate_temp.join(
... precip_one_station, lsuffix="_left", rsuffix="_right"
... ).shape
(127020, 50)
Notice that the DataFrame is larger, but data that doesn’t exist in the smaller
DataFrame, precip_one_station, is filled in with NaN values.
How to Use .join()
By default, .join() will attempt to do a left join on indices. If you want to join on columns
like you would with merge(), then you’ll need to set the columns as indices.
Like merge(), .join() has a few parameters that give you more flexibility in your joins.
However, with .join(), the list of parameters is relatively short:
other is the only required parameter. It defines the other DataFrame to join. You
can also specify a list of DataFrames here, allowing you to combine a number of
datasets in a single .join() call.
on specifies an optional column or index name for the left DataFrame
(climate_temp in the previous example) to join the other DataFrame’s index. If it’s
set to None, which is the default, then you’ll get an index-on-index join.
how has the same options as how from merge(). The difference is that it’s index-
based unless you also specify columns with on.
lsuffix and rsuffix are similar to suffixes in merge(). They specify a suffix to add to
any overlapping columns but have no effect when passing a list
of other DataFrames.
sort can be enabled to sort the resulting DataFrame by the join key.
Examples
In this section, you’ll see examples showing a few different use cases for .join(). Some will
be simplifications of merge() calls. Others will be features that set .join() apart from the
more verbose merge() calls.
Since you already saw a short .join() call, in this first example you’ll attempt to recreate
a merge() call with .join(). What will this require? Take a second to think about a possible
solution, and then look at the proposed solution below:
>>>
>>> inner_merged_total = pd.merge(
... climate_temp, climate_precip, on=["STATION", "DATE"]
... )
>>> inner_merged_total.shape
(123005, 48)
To accomplish this, you’ll use a concat() call like you did above, but you’ll also need to pass
the axis parameter with a value of 1 or "columns":
concatenated = pandas.concat([df1, df2], axis="columns")
Note: This example assumes that your indices are the same between datasets. If they’re
different while concatenating along columns (axis 1), then by default the extra indices
(rows) will also be added, and NaN values will be filled in as applicable.
You’ll learn more about the parameters for concat() in the section below.
How to Use concat()
As you can see, concatenation is a simpler way to combine datasets. It’s often used to form
a single, larger set to do additional operations on.
Note: When you call concat(), a copy of all the data that you’re concatenating is made. You
should be careful with multiple concat() calls, as the many copies that are made may
negatively affect performance. Alternatively, you can set the optional copy parameter
to False
When you concatenate datasets, you can specify the axis along which you’ll concatenate.
But what happens with the other axis?
Nothing. By default, a concatenation results in a set union, where all data is preserved.
You’ve seen this with merge() and .join() as an outer join, and you can specify this with
the join parameter.
If you use this parameter, then the default is outer, but you also have the inner option,
which will perform an inner join, or set intersection.
As with the other inner joins you saw earlier, some data loss can occur when you do an
inner join with concat(). Only where the axis labels match will you preserve rows or
columns.
Note: Remember, the join parameter only specifies how to handle the axes that
you’re not concatenating along.
Since you learned about the join parameter, here are some of the other parameters
that concat() takes:
objs takes any sequence—typically a list—of Series or DataFrame objects to be
concatenated. You can also provide a dictionary. In this case, the keys will be used to
construct a hierarchical index.
axis represents the axis that you’ll concatenate along. The default value is 0, which
concatenates along the index, or row axis. Alternatively, a value of 1 will concatenate
vertically, along columns. You can also use the string values "index" or "columns".
join is similar to the how parameter in the other techniques, but it only accepts the
values inner or outer. The default value is outer, which preserves data,
while inner would eliminate data that doesn’t have a match in the other dataset.
ignore_index takes a Boolean True or False value. It defaults to False. If True, then
the new combined dataset won’t preserve the original index values in the axis
specified in the axis parameter. This lets you have entirely new index values.
keys allows you to construct a hierarchical index. One common use case is to have a
new index while preserving the original indices so that you can tell which rows, for
example, come from which original dataset.
copy specifies whether you want to copy the source data. The default value is True.
If the value is set to False, then pandas won’t make copies of the source data.
This list isn’t exhaustive. You can find the complete, up-to-date list of parameters in
the pandas documentation.
Examples
First, you’ll do a basic concatenation along the default axis using the DataFrames that
you’ve been playing with throughout this tutorial:
>>>
>>> double_precip = pd.concat([precip_one_station, precip_one_station])
>>> double_precip.shape
(730, 29)
This one is very simple by design. Here, you created a DataFrame that is a double of a small
DataFrame that was made earlier. One thing to notice is that the indices repeat. If you want
a fresh, 0-based index, then you can use the ignore_index parameter:
>>>
>>> reindexed = pd.concat(
... [precip_one_station, precip_one_station], ignore_index=True
... )
>>> reindexed.index
RangeIndex(start=0, stop=730, step=1)
As noted before, if you concatenate along axis 0 (rows) but have labels in axis 1 (columns)
that don’t match, then those columns will be added and filled in with NaN values. This
results in an outer join:
>>>
>>> outer_joined = pd.concat([climate_precip, climate_temp])
>>> outer_joined.shape
(278130, 47)
With these two DataFrames, since you’re just concatenating along rows, very few columns
have the same name. That means you’ll see a lot of columns with NaN values.
To instead drop columns that have any missing data, use the join parameter with the
value "inner" to do an inner join:
>>>
>>> inner_joined = pd.concat([climate_temp, climate_precip], join="inner")
>>> inner_joined.shape
(278130, 3)
Using the inner join, you’ll be left with only those columns that the original DataFrames
have in common: STATION, STATION_NAME, and DATE.
You can also flip this by setting the axis parameter:
>>>
>>> inner_joined_cols = pd.concat(
... [climate_temp, climate_precip], axis="columns", join="inner"
... )
>>> inner_joined_cols.shape
(127020, 50)
Now you have only the rows that have data for all columns in both DataFrames. It’s no
coincidence that the number of rows corresponds with that of the smaller DataFrame.
Another useful trick for concatenation is using the keys parameter to create hierarchical
axis labels. This is useful if you want to preserve the indices or column names of the
original datasets but also want to add new ones:
>>>
>>> hierarchical_keys = pd.concat(
... [climate_temp, climate_precip], keys=["temp", "precip"]
... )
>>> hierarchical_keys.index
MultiIndex([( 'temp', 0),
( 'temp', 1),
...
('precip', 151108),
('precip', 151109)],
length=278130)
If you check on the original DataFrames, then you can verify whether the higher-level axis
labels temp and precip were added to the appropriate rows.
You’ve now learned the three most important techniques for combining data in pandas:
1. merge() for combining data on common columns or indices
2. .join() for combining data on a key column or an index
3. concat() for combining DataFrames across rows or columns
In addition to learning how to use these techniques, you also learned about set logic by
experimenting with the different ways to join your datasets. Additionally, you learned
about the most common parameters to each of the above techniques, and what arguments
you can pass to customize their output.
You saw these techniques in action on a real dataset obtained from the NOAA, which
showed you not only how to combine your data but also the benefits of doing so with
pandas’ built-in techniques. If you haven’t downloaded the project files yet, you can get
them here:
Download the notebook and data set: Click here to get the Jupyter Notebook and CSV
data set you’ll use to learn about Pandas merge(), .join(), and concat() in this tutorial.
Did you learn something new? Figure out a creative way to solve a problem by combining
complex datasets? Let us know in the comments below!
Grouping and Aggregating with Pandas
We are going to see grouping and aggregating using pandas. Grouping and aggregating
will help to achieve data analysis easily using various functions. These methods will help
us to the group and summarize our data and make complex analysis comparatively easy.
Creating a sample dataset of marks of various subjects.
# import module
import pandas as pd
# Creating our dataset
df = pd.DataFrame([[9, 4, 8, 9],
[8, 10, 7, 6],
[7, 6, 8, 5]],
columns=['Maths', 'English',
'Science', 'History'])
# display dataset
print(df)
output:
Aggregation in Pandas
Aggregation in pandas provides various functions that perform a mathematical or logical
operation on our dataset and returns a summary of that function. Aggregation can be
used to get a summary of columns in our dataset like getting sum, minimum, maximum,
etc. from a particular column of our dataset. The function used for aggregation is agg(),
the parameter is the function we want to perform.
Some functions used in the aggregation are:
Function Description:
sum() :Compute sum of column values
min() :Compute min of column values
max() :Compute max of column values
mean() :Compute mean of column
size() :Compute column sizes
describe() :Generates descriptive statistics
first() :Compute first of group values
last() :Compute last of group values
count() :Compute count of column values
std() :Standard deviation of column
var() :Compute variance of column
sem() :Standard error of the mean of column
We used agg() function to calculate the sum, min, and max of each column in our
dataset.
df.agg(['sum', 'min', 'max'])
Grouping in Pandas
Grouping is used to group data using some criteria from our dataset. It is used as split-
apply-combine strategy.
Splitting the data into groups based on some criteria.
Applying a function to each group independently.
Combining the results into a data structure.
Examples:
We use groupby() function to group the data on “Maths” value. It returns the object as
result.
df.groupby(by=['Maths'])
Applying groupby() function to group the data on “Maths” value. To view result of formed
groups use first() function.
a = df.groupby('Maths')
a.first()
output:
First grouping based on “Maths” within each team we are grouping based on “Science”
b = df.groupby(['Maths', 'Science'])
b.first()
output:
Implementation on a Dataset
Here we are using a dataset of diamond information.
# import module
import numpy as np
import pandas as pd
# reading csv file
dataset = pd.read_csv("diamonds.csv")
# printing first 5 rows
print(dataset.head(5))
Output:
dataset.groupby('cut').sum()
Output:
Here we are grouping using cut and color and getting minimum value for all other
groups.
dataset.groupby(['cut', 'color']).agg('min')
Output:
Here we are grouping using color and getting aggregate values like sum, mean, min,
etc. for the price group.
Output:
We can see that in the prod(product i.e. multiplication) column all values are inf, inf is the
result of a numerical calculation that is mathematically infinite.
Pivot Table
Pandas.pivot_table()
pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc=’mean’,
fill_value=None, margins=False, dropna=True, margins_name=’All’) create a spreadsheet-
style pivot table as a DataFrame.
Levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the
index and columns of the result DataFrame.
Parameters:
data : DataFrame
values : column to aggregate, optional
index: column, Grouper, array, or list of the previous
columns: column, Grouper, array, or list of the previous
aggfunc: function, list of functions, dict, default numpy.mean
-> If list of functions passed, the resulting pivot table will have hierarchical columns whose
top level are the function names.
-> If dict is passed, the key is column to aggregate and value is function or list of functions
fill_value[scalar, default None] : Value to replace missing values with
margins[boolean, default False] : Add all row / columns (e.g. for subtotal / grand totals)
dropna[boolean, default True] : Do not include columns whose entries are all NaN
margins_name[string, default ‘All’] : Name of the row / column that will contain the
totals when margins is True.
Returns: DataFrame
# Create a simple dataframe
# importing pandas as pd
import pandas as pd
import numpy as np
# creating a dataframe
df = pd.DataFrame({'A': ['John', 'Boby', 'Mina', 'Peter', 'Nicky'],
'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'],
'C': [27, 23, 21, 23, 24]})
df
# Simplest pivot table must have a dataframe
# and an index/list of index.
table = pd.pivot_table(df, index =['A', 'B'])
table
Question bank
Part - A