Prepared by
Senthil Kumar S
Assistant Professor,
Department of Information Technology,
Sri Ramakrishna Mission Vidyalaya
College of Arts and Science
(Autonomous), Coimbatore
Pandas:
Introduction to pandas
- Data manipulation with pandas
- Operating on null values,
- hierarchical indexing
- Combining Datasets
- Aggregation and Grouping.
- Manipulation of data with combined datasets using Pandas.
Pandas is a Python library.
Pandas is used to analyze data.
Pandas is a Python library used for working with data sets.
A data frame is a import pandas as pd
structured d = {'col1':
representation of [1, 2, 3, 4, 7], 'col2':
data. [4, 5, 6, 9, 5], 'col3':
Data frame with 3 [7, 8, 12, 1, 11]}
rows and 5 columns df =
pd.DataFrame(data=d)
print(df)
import pandas as pd
df = pd.DataFrame( {'name':
['Akshay','Mukesh','Deepak']
, 'age':[22,23,21], 'country':
['india','india','us']})
print(df)
Output:
name age country 0 Akshay
22 india 1 Mukesh 23 india 2
Deepak 21 us
regno name mark
0 1 senthil 76
1 2 jamuna 67
2 3 rakki 88
import pandas 3 4 kavi 90
df = pandas.read_csv("D:\SRMV 4 5 karthi 89
CAS\ssk\csv\sdata.csv") 5 6 mahes 66
print(df) 6 7 seetha 54
print(df.loc[2,:]) regno 3
name rakki
mark 88
Name: 2, dtype: object
# importing pandas as pd Output:
import pandas as pd First Score Second Score Third
# importing numpy as np Score
import numpy as np 0 100.0 30.0 NaN
# dictionary of lists 1 90.0 45.0 40.0
dict = {'First Score':[100, 90, np.nan, 2 NaN 56.0 80.0
95], 3 95.0 NaN 98.0
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
First Score Second Score Third Score
# creating a dataframe from
dictionary 1 90.0 45.0 40.0
df = pd.DataFrame(dict)
print(df)
# Remove Blank Rows using dropna()
function
df.dropna(axis=0,inplace=True)
print(df)
# importing pandas as pd
import pandas as pd Output
# importing numpy as np
import numpy as np First Score Second Score Third Score
# dictionary of lists 0 100.0 30.0 NaN
dict = {'First Score':[100, 90, np.nan, 1 90.0 45.0 40.0
95], 2 NaN 56.0 80.0
3 95.0 NaN 98.0
'Second Score': [30, 45, 56, np.nan], First Score Second Score Third Score
'Third Score':[np.nan, 40, 80, 98]} 0 100.0 30.0 0.0
# creating a dataframe from 1 90.0 45.0 40.0
dictionary 2 0.0 56.0 80.0
3 95.0 0.0 98.0
df = pd.DataFrame(dict)
print(df)
# filling missing value using zero
print(df.fillna(0))
use the describe() function in Python to First Score Second Score Third Score
0 100 30.0 NaN
summarize data 1 90 45.0 40.0
2 87 56.0 80.0
# importing pandas as pd 3 95 NaN 98.0
import pandas as pd First Score Second Score Third Score
count 4.000000 3.000000 3.000000
# importing numpy as np mean 93.000000 43.666667
72.666667
import numpy as np std 5.715476 13.051181 29.687259
min 87.000000 30.000000
# dictionary of lists 40.000000
dict = {'First Score':[100, 90, 87, 95], 25% 89.250000 37.500000
60.000000
'Second Score': [30, 45, 56, np.nan], 50% 92.500000 45.000000
80.000000
'Third Score':[np.nan, 40, 80, 98]} 75% 96.250000 50.500000
89.000000
# creating a dataframe from dictionary max 100.000000 56.000000
df = pd.DataFrame(dict) 98.000000
print(df)
print(df.describe())
Primarily we focus on one-
dimensional and two
dimensional data
Often, it is useful to go
beyond this and store
higher-dimensional data—
that is, data indexed by
more than one or two
keys.
To handle three-
dimensional and four-
dimensional data, common
Hierarchical Indexes are also known
pattern in practice is to as multi-indexing is setting more
make use of hierarchical than one column name as the index.
indexing (also known as
multi-indexing)
# importing pandas as pd
import pandas as pd
df1 = pd.DataFrame({'employee': ['Bob',
'Jake', 'Lisa', 'Sue'], 'group':
['Accounting', 'Engineering',
'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa',
'Bob', 'Jake', 'Sue'], 'hire_date': [2004,
2008, 2012, 2014]})
display (df1,df2)
# importing pandas as pd
import pandas as pd
df1 = pd.DataFrame({'employee': ['Bob',
'Jake', 'Lisa', 'Sue'], 'group':
['Accounting', 'Engineering',
'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa',
'Bob', 'Jake', 'Sue'], 'hire_date': [2004,
2008, 2012, 2014]})
display(df1,df2)
df3= pd.merge(df1,df2)
display(df3)
use the describe() function in Python to First Score Second Score Third Score
0 100 30.0 NaN
summarize data 1 90 45.0 40.0
# importing pandas as pd 2 87 56.0 80.0
3 95 NaN 98.0
import pandas as pd
# importing numpy as np
import numpy as np First Score Second Score Third Score
count 4.000000 3.000000 3.000000
# dictionary of lists mean 93.000000 43.666667 72.666667
std 5.715476 13.051181 29.687259
dict = {'First Score':[100, 90, 87, 95], min 87.000000 30.000000 40.000000
'Second Score': [30, 45, 56, np.nan], 25% 89.250000 37.500000 60.000000
50% 92.500000 45.000000 80.000000
'Third Score':[np.nan, 40, 80, 98]} 75% 96.250000 50.500000 89.000000
max 100.000000 56.000000 98.000000
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
print(df)
print(df.describe())
The groupby() method
allows you to group
your data and execute
functions