Exercise No. 12 : To study Pandas library.
In [1]: #Import the library
import numpy as np
import pandas as pd
1. Read csv
In [2]: # Reading data from a CSV file
data = pd.read_csv('F:\\MP\\ML\\ML Program Notebooks\\titanic_train.csv')
data.drop('Name', axis = 1, inplace = True)
In [ ]:
2. Display first and last five elements of
data
In [3]: # display first five elements of data
data.head()
Out[3]: PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
A/5
0 1 0 3 male 22.0 1 0 7.2500 NaN S
21171
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
STON/O2.
2 3 1 3 female 26.0 0 0 7.9250 NaN S
3101282
3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
In [4]: # display first ten elements of data
data.head(10)
Out[4]: PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
A/5
0 1 0 3 male 22.0 1 0 7.2500 NaN S
21171
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
STON/O2.
2 3 1 3 female 26.0 0 0 7.9250 NaN S
3101282
3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 female 14.0 1 0 237736 30.0708 NaN C
In [5]: # display last five elements of data
data.tail()
Out[5]: PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
886 887 0 2 male 27.0 0 0 211536 13.00 NaN S
887 888 1 1 female 19.0 0 0 112053 30.00 B42 S
W./C.
888 889 0 3 female NaN 1 2 23.45 NaN S
6607
889 890 1 1 male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 male 32.0 0 0 370376 7.75 NaN Q
In [6]: # display last ten elements of data
data.tail(10)
Out[6]: PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Emba
881 882 0 3 male 33.0 0 0 349257 7.8958 NaN
882 883 0 3 female 22.0 0 0 7552 10.5167 NaN
C.A./SOTON
883 884 0 2 male 28.0 0 0 10.5000 NaN
34068
SOTON/OQ
884 885 0 3 male 25.0 0 0 7.0500 NaN
392076
885 886 0 3 female 39.0 0 5 382652 29.1250 NaN
886 887 0 2 male 27.0 0 0 211536 13.0000 NaN
887 888 1 1 female 19.0 0 0 112053 30.0000 B42
888 889 0 3 female NaN 1 2 W./C. 6607 23.4500 NaN
889 890 1 1 male 26.0 0 0 111369 30.0000 C148
890 891 0 3 male 32.0 0 0 370376 7.7500 NaN
In [7]: data
Out[7]: PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embark
A/5
0 1 0 3 male 22.0 1 0 7.2500 NaN
21171
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85
STON/O2.
2 3 1 3 female 26.0 0 0 7.9250 NaN
3101282
3 4 1 1 female 35.0 1 0 113803 53.1000 C123
4 5 0 3 male 35.0 0 0 373450 8.0500 NaN
... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 male 27.0 0 0 211536 13.0000 NaN
887 888 1 1 female 19.0 0 0 112053 30.0000 B42
W./C.
888 889 0 3 female NaN 1 2 23.4500 NaN
6607
889 890 1 1 male 26.0 0 0 111369 30.0000 C148
890 891 0 3 male 32.0 0 0 370376 7.7500 NaN
891 rows × 11 columns
In [ ]:
3. Describe a summary of data statistics
In [8]: data.describe()
Out[8]: PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
In [9]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Sex 891 non-null object
4 Age 714 non-null float64
5 SibSp 891 non-null int64
6 Parch 891 non-null int64
7 Ticket 891 non-null object
8 Fare 891 non-null float64
9 Cabin 204 non-null object
10 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB
In [ ]:
4. Insert a new column to data
In [10]: data['new_column'] = 1
data.head()
Out[10]: PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
A/5
0 1 0 3 male 22.0 1 0 7.2500 NaN S
21171
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
STON/O2.
2 3 1 3 female 26.0 0 0 7.9250 NaN S
3101282
3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
In [ ]:
5. Delete a column from a dataframe
In [11]: data.drop('new_column', axis = 1).head()
Out[11]: PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
A/5
0 1 0 3 male 22.0 1 0 7.2500 NaN S
21171
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
STON/O2.
2 3 1 3 female 26.0 0 0 7.9250 NaN S
3101282
3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
In [12]: data
Out[12]: PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embark
A/5
0 1 0 3 male 22.0 1 0 7.2500 NaN
21171
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85
STON/O2.
2 3 1 3 female 26.0 0 0 7.9250 NaN
3101282
3 4 1 1 female 35.0 1 0 113803 53.1000 C123
4 5 0 3 male 35.0 0 0 373450 8.0500 NaN
... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 male 27.0 0 0 211536 13.0000 NaN
887 888 1 1 female 19.0 0 0 112053 30.0000 B42
W./C.
888 889 0 3 female NaN 1 2 23.4500 NaN
6607
889 890 1 1 male 26.0 0 0 111369 30.0000 C148
890 891 0 3 male 32.0 0 0 370376 7.7500 NaN
891 rows × 12 columns
In [13]: data.drop('new_column', axis = 1, inplace = True)
In [14]: data.head()
Out[14]: PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
A/5
0 1 0 3 male 22.0 1 0 7.2500 NaN S
21171
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
STON/O2.
2 3 1 3 female 26.0 0 0 7.9250 NaN S
3101282
3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
In [ ]:
6. Set column as index
In [15]: # set column 'PassengerId' as index
data.set_index('PassengerId')
Out[15]: Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
A/5
1 0 3 male 22.0 1 0 7.2500 NaN S
21171
2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
STON/O2.
3 1 3 female 26.0 0 0 7.9250 NaN S
3101282
4 1 1 female 35.0 1 0 113803 53.1000 C123 S
5 0 3 male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ...
887 0 2 male 27.0 0 0 211536 13.0000 NaN S
888 1 1 female 19.0 0 0 112053 30.0000 B42 S
W./C.
889 0 3 female NaN 1 2 23.4500 NaN S
6607
890 1 1 male 26.0 0 0 111369 30.0000 C148 C
891 0 3 male 32.0 0 0 370376 7.7500 NaN Q
891 rows × 10 columns
In [16]: data.head()
Out[16]: PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
A/5
0 1 0 3 male 22.0 1 0 7.2500 NaN S
21171
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
STON/O2.
2 3 1 3 female 26.0 0 0 7.9250 NaN S
3101282
3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
In [ ]:
7. Sort values along an axis (ascending
order)
In [17]: data.sort_values(by='Pclass')
Out[17]: PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embark
445 446 1 1 male 4.0 0 2 33638 81.8583 A34
310 311 1 1 female 24.0 0 0 11767 83.1583 C54
PC
309 310 1 1 female 30.0 0 0 56.9292 E36
17485
PC
307 308 1 1 female 17.0 1 0 108.9000 C65
17758
306 307 1 1 female NaN 0 0 17421 110.8833 NaN
... ... ... ... ... ... ... ... ... ... ...
379 380 0 3 male 19.0 0 0 347069 7.7750 NaN
381 382 1 3 female 1.0 0 2 2653 15.7417 NaN
STON/O
382 383 0 3 male 32.0 0 0 2. 7.9250 NaN
3101293
371 372 0 3 male 18.0 1 0 3101267 6.4958 NaN
890 891 0 3 male 32.0 0 0 370376 7.7500 NaN
891 rows × 11 columns
In [18]: # sort values in a descending order
data.sort_values(by='Pclass', ascending = False)
Out[18]: PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Emba
0 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN
SOTON/OQ
511 512 0 3 male NaN 0 0 8.0500 NaN
3101316
500 501 0 3 male 17.0 0 0 315086 8.6625 NaN
501 502 0 3 female 21.0 0 0 364846 7.7500 NaN
502 503 0 3 female NaN 0 0 330909 7.6292 NaN
... ... ... ... ... ... ... ... ... ... ...
102 103 0 1 male 21.0 0 1 35281 77.2875 D26
710 711 1 1 female 24.0 0 0 PC 17482 49.5042 C90
711 712 0 1 male NaN 0 0 113028 26.5500 C124
712 713 1 1 male 48.0 1 0 19996 52.0000 C126
445 446 1 1 male 4.0 0 2 33638 81.8583 A34
891 rows × 11 columns
In [19]: data.head()
Out[19]: PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
A/5
0 1 0 3 male 22.0 1 0 7.2500 NaN S
21171
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
STON/O2.
2 3 1 3 female 26.0 0 0 7.9250 NaN S
3101282
3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
In [ ]:
8. Description of Rows and Columns
In [20]: # Retrieve rows and columns count
data.shape
(891, 11)
Out[20]:
In [21]: # Display list of column names
data.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
Out[21]:
'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
In [22]: # Display list of row names (indexes)
data.index
RangeIndex(start=0, stop=891, step=1)
Out[22]:
In [ ]:
9. Retrieve number of non-NA and missing
values
In [23]: # display the count of non-NA values in each column
data.count()
PassengerId 891
Out[23]:
Survived 891
Pclass 891
Sex 891
Age 714
SibSp 891
Parch 891
Ticket 891
Fare 891
Cabin 204
Embarked 889
dtype: int64
In [24]: # Display missing value status for each data point
data.isna()
Out[24]: PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 False False False False False False False False False True False
1 False False False False False False False False False False False
2 False False False False False False False False False True False
3 False False False False False False False False False False False
4 False False False False False False False False False True False
... ... ... ... ... ... ... ... ... ... ... ...
886 False False False False False False False False False True False
887 False False False False False False False False False False False
888 False False False False True False False False False True False
889 False False False False False False False False False False False
890 False False False False False False False False False True False
891 rows × 11 columns
In [25]: # display the count of missing values in each column
data.isna().sum()
PassengerId 0
Out[25]:
Survived 0
Pclass 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
In [ ]:
10. Display datatype of each column
In [26]: data.dtypes
PassengerId int64
Out[26]:
Survived int64
Pclass int64
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
In [ ]: