0 ratings0% found this document useful (0 votes) 15 views15 pagesPandas Exercises
learn pandas with exercices
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Let's create a DataFrame
import pandas as pd
iris = pd.read_csv(“iris-write-fron-docker.csv")
iow Let's Look at the type of iris object.
print (type(iris))
‘What columns does the Dataframe consist off
inis.colunns
Index(['sepal length’, ‘sepal width’, ‘petal length’, ‘petal width’
object’)
Let's look at the first 10 lines of the iris
iris.head(10)
sepal_length sepal_width petal_length petal class
0 54 35 14 0.2 ris-setosa
1 4s 30 14 0.2 ris-setosa
2 47 2»? 3 0? is-setosa
3 46 ar 18 0.2 ris-setosa
4 50 36 14 0.2 ris-setosa
5 54 39 7 04 ris-setosa
6 46 34 14 03. ris-setosa
7 50 34 1s 0.2 ris-setosa
8 44 29 14 0.2 ris-setosa
9 4s 3 1s 0.1 is-setosa
Let's look at the last 10 lines of the iris
iris.tail(10)
sepal length sepal_width petal length petal width class
140 57 34 56 24 Iris virginice
141 59 54 23. Iris-virginice
142 58 ar 51 19. Iris-virginice
“class'1, dtypesepal_length sepal width petal length petal width class
143 58 32 59 23. Iris-viginice
144 87 33 ST 25 ris-virginice
145 67 30 52 23. ris-virginice
146 63 25 50 19. Iris-virginice
147 6s 30 52 2.0 ris-virginice
148, 62 34 54 23. ris-virginice
149 58 30 51 18 lris-virginice
Let's find out the size of the data
inis.shape
(158, 5)
Another method where we can get information about the data frame is the info) method
This method gives us values such as the data type of the columns, the number of rows in the
data frame, the number of data in each column.
iris. info()
RangeIndex: 15@ entries, @ to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
@ sepal_length 15@ non-null float64
1 sepal_width 150 non-null floate4
2 petal_length 150 non-null —float64
3 petal_width 150 non-null —_float64
4 class 150 non-null object
dtypes: Floates(4), object(1)
nemory usage: 6.0+ KB
We
an use the copy() method to transfer a dataframe to another dataframe.
inis_new = iris. copy()
inis_new-head(5)
sepal_length sepal width petal length ith class
0 51 35 14 02 tis-setose
1 4s 30 14 02 tis-setosa
2 ay 32 13 02 is-setosa
3 46 3 18 02 tis-setosesepal_length sepal.width petal_length petal width class
4 50 36 14 02. tis-setosa
We can implement the methods valid for Numpy in pandas dataframes as well. For example,
we can apply the og) method in the Numpy package to find the natural logarithm of values
ina data frame consisting of numeric values
flet's first select the numeric columns of the iris datafrane.
import numpy as np
arr = iris.iloc{:,[8,1,2,3]]
arr_log = np.log(arr)
arr_log
sepal_length sepal width petal length petal width
0 162924" 1.252763 «9.336472 -1,609438
11589238 1.098612 0336472 -1,609438
2 1547563 1.163157 «0.262364 -1.609438
3 1526056 1.131402 9.405465 -1,609438
4 1609438 1.280934 9.336472 -1,609438
145 1.902108 1.098612 1.648659 0.832909
146 1.840550 0.916297 1.609438 0.641854
1471871802 1.098612 1.648659 0.693147
148 1.824549—1.22375 1.686399 0.832909
149 1.74952 1.098612 1.629247 0587787
150 rows x 4 columns
CHOOSING IN A CATAFRAME
iris["sepal_length"]
2 5.1
1 4.9
2 4.7
3 4.6
4 5.@
u45 6.7
146 6.3
1476.5
1486.2
195.9
Name: sepal_length, Length: 158, dtype: floateaIf we use single square brackets next to the data set, it will download the data in the column as
one-dimensional. You can also think of it as a list. As a matter of fact, when we examine the
type of data we have drawn in this way, we see that itis a one-dimensional Series data type
type(iris["sepal_length"])
pandas.core.series.Series
If we want the column we want to act as a data frame, we need to write the column name in
two square brackets
print (type(iris[["sepal_length"]]))
inis[["sepal_length"]]
sepal_length
0 5a
1 4g
2 az
3 46
4 50
145 87
146 63
147 6S
148 52
149 59
150 rows x 1 columns
Itis possible to see more than one column,
iris[["sepal_length”,"sepal_width"]]
sepal_length sepal_width
° Se 35
1 4g 30
2 az 32
3 46 3
4 50 36sepal_length sepal_width
145 87 30
146 63 25
147 65 30
148 52 34
149 59 30
150 rows x 2 columns
/e seen selecting columns, now let's see selecting rows.
inis[2:5]
sepal_length sepal width petal length ith class
2 ar 32 13 02 tis-setose
3 46 3 18 02 tis-setosa
4 50 36 14 02 is-setosa
Now let's want to see the information of the Sth row. For this, we use the Joc{] method anc
the desired row's name is given as a parameter
aseries
iris. loc[5]
sepal_length 5.4
sepal_width 3.9
17
0.4
Inis-setosa
Name: 5, dtype: object
iris.loc{(5]]
sepal_length sepal width petal_length petal width class.
5 54 39 W 04 tis-setosa
multiple Line selection
iris. loc[[5,6]]
sepal_length sepal.width petal length petal » class
5 54 39 7 04 tis-setosasepal_length sepal.width petal_length petal width class
6 46 34 14 03. is-setose
inis["petal_length"][5]
17
udifferent spelling
iris. petal_length[5]
17
inis.loc(5,"class")
"Inis-setosa’
iris. loc[[1,2,3,4,5],["petal_length
petal_length petalwidth class.
1 14 02 Iris-setose
2 12 02 Iris-setose
3 18 02. Iris-setose
4 14 02. Iris-setose
5 W 04 Iris-setose
inis.loc[:,["sepal_length","class"]]
sepal_length class
° 5. tis-setosa
1 49 is-setosa
2 47 is-setosa
3 46 is-setose
4 5.0 is-setosa
145 5.1 Iris-vrginica
146 63. Iris-virginica
147 65. Iris-virginica
148, 8.2. Iris-virginicasepal_length class
149 59. Iris-virginice
150 rows x 2 columns
The .iloc{] function is used to select index numbers instead of row and column names.
iris. iloc(1)
sepal_length 4
sepal_width 3
petal_length 1
8,
0
petal_width .
class Iris-setos:
Name: 1, dtype: object
9
e
4
2
a
inis.iloc[[1]]
sepal_length sepal.width petal length petal class
1 4g 30 14 02 tis-setose
iris. iloc[6,1]
3.4
iris. iloc[[1,2,3,6]]
sepal_length sepal_width petal length petal » class
1 4g 30 14 02 tis-setose
2 ay 32 12 02 tis-setosa
3 46 18 02 tis-setosa
6 46 34 14 03 tis-setose
To select multiple rows and columns, just like in the loc[] function, but this time with row anc
column index numbers
iris. iloc[[1,2,3],[2,3]]
petal length petal width
1 14 02
2 13 02
3 1s 02Itis possible to use column names and index numbers together.
inis["class"][@:
2 Inis-setosa
1 Inis-setosa
2 Inis-setosa
3 Inis-setosa
4 Inis-setosa
Name: class, dtype: object
iris[["sepal_length”, "sepal_width"]][@:5]
sepal_length sepal width
0 5A 35
1 4g 30
2 ay 32
3 46 3
4 50 36
iris. loc[5:10, "sepal_length”:"petal_length"]
sepal_length sepal width petal length
5 54 39 7
6 46 34 14
7 50 34 1s
8 44 2¢ 14
9 49 34 1s
10 54 37 1s
iris{"sepal_length’] = takes the column as a one-dimensional data array. iis{(’sepal_length"]] =
takes the column as pandas dataframe. iris[2:5] = Retrieves rows from line 3 (index number 2)
toline 5 (index number 4) iris.loc(5] = takes the column as a one-dimensional data array.
iris loc{{5I] = takes the column as pandas dataframe irisiloc{{SI] = Retrieves the row with index
number 1.
PANDAS DATA ANALYSIS
Values such as minimum, maximum, mean, standard deviation, median, 25% slice are available
in the .describe() method of the pandas module.
inis.describe()sepal_length sepal.width petal length petal width
count — 150,000000 150.00000¢ 150.0000 150.0000
mean 5.843333 «3.054000 «3.758667 ‘1.198667
std 0.828065 9.433594 1.764420 0.763161
min 4300000 2.00000¢~—1.000000 0.100000
25% 5.100000 2.800000 ~~ 1.600000 0.300000
30% 5.800000 3.000000 ~—«4.350000 ‘1.300000
75% 5.400000 330000C~=—$.100000 ‘1.800000
max 7.900000 4.40000¢ +~—-6.900000 2.500000
inis["class"] .describe()
count: 156
unique 3
top Iris-setose
freq 5e
Name: class, dtype: object
We can use the .unique() method to see which categorical variables are included in a column,
inis["class"] -unique()
array(['Iris-setosa', 'Iris-versicolor’, ‘Iris-virginica’], dtype=object)
-count() method is available to separately calculate summary values such as mean anc
standard deviation of numerical information.
inis.count()
sepal_length 15@
sepal_width 15
petal_length 15@
petal_width 15@
class 158
dtype: inted
data = [“petal_length”, ‘petal_width']
iris[data].count()
petal_length 15@
petal_width 15@
dtype: inted
Mean
rmean() Standard deviation
-quantile(y) y is amount of percentile
> std() Median
> amedian() Percentile
In order to make row-based calculations, we need to specify the axi
columns’ argument inthe method,
iris.mean(axis='colunns’)
C:\Users \batuh\AppData\Local \Temp\ipykernel_15580\987785517.py:1: FutureWarning: Orc
pping of nuisance colunns in Datafrane reductions (with ‘nuneric_only-None') is depr
ecated; in a future version this will raise Typetrror. Select only valid colunns be
fore calling the reduction
iris.mean(axis="columns')
2 2.558
1 2.375
2 2.358
3 2.358
4 2558
154.300
146 3.925
a7 4.175
14g 4.325
14g 3.95@
Length: 150, dtype: floatea
inis.mean(axis=1)
C:\Users\batuh\AppData\Local \Temp\ipykernel_15580\1464791641.py:1: FutureWarning: Dr
opping of nuisance columns in DataFrane reductions (with ‘numeric only=None') is dep
recated; in a future version this will raise TypeError. Select only valid columns b
efore calling the reduction.
inis.mean(axis=1)
2 2.558
1 2.375
2 2.358
3 2.358
4 2.558
154.300
146 3.925
147 4.475
148 4,325
ug 3.95¢
Length: 150, dtype: floatea
iris.mean(axis="rows*)
C:\Users\batuh\AppData\Local \Temp\ipykernel_15580\2870185531.py:1: FutureWarning: Dr
opping of nuisance columns in DataFrame reductions (with ‘numeric_onlysNone") is dep
recated; in a future version this will raise TypeError. Select only valid columns b
efore calling the reduction.
inis.mean(axis="rows' )
sepal_length 5.843333
sepal_width 3.@54000
petal_length 3.758667
petal_width 1.198667
dtype: floatea
The way to make conditional selection in text type columns is to apply the .str method.inis2 = pd.read_csv(""iris-write-fron-docker.cs\
cond = inis2["class"].str.contains("Iris-setos:
setosa = iris2[ cond)
inis2.head()
sepal_length sepal.width petal_length petal width class
0 5A 35 14 02 tis-setose
1 4s 30 14 02 tis-setose
2 a7 32 12 02 tis-setosa
3 46 a 18 02 tis-setosa
4 50 36 14 02 tis-setose
CONDITIONAL CHOICES
inis[iris.sepal_length > 7.5]
sepal_length sepal width petal length petal width class
105 16 30 66 2.1 Iis-vinginice
"17 17 38 67 22. ris-virginice
118 17 26 69 23. ris-virginice
122 17 28 67 2. ris-virginice
131 79 38 64 2. Iris-virginice
135 17 30 6 23. Iris-virginice
inis[(inis.sepal_length > 6.5) & (iris.petal_length <4.5)]
sepal_length sepal width petal_length petal width class
65 67 34 44 14. is-versicolor
75 66 30 44 14 ris-versicolor
iris[(iris.sepal_length > 7.5) | (iris.petal_length > 6.5)]
sepal_length sepal width petal length petal width class
105 16 30 66 2.1 Iris-virginice
"7 17 38 67 22. ris-virginice
118 17 26 69 23. ris-virginice
122 17 28 67 2. Iris-virginice
131 79 38 64 2.0 Iris-virginicesepal_length sepal width petal length petal width class
135 17 30 6 23. Iris-virginice
Itis also possible to pull data in another column by applying a condition to the data of one
column. For example, let's see the petal length values of the rows with the sepal length value >
75.
iris.petal_length[iris.sepal_length > 7.5]
105
117
118
122
331
B56.
Name: petal_length, dtype: floated
ally, let's take a look at the reshaping and manipulation operations that can be performed
on Fandas dataframes. Let's create the following df dataframe to use in our examples
variable = np.repeat(['A’,'8',
val = np.random.random(12)
df_dict = {'variable' :variable, ‘value’ :val}
df = pd.DataFrane(df_dict)
df = df[[ ‘variable’, ‘value’]]
dF
D’},[3,3,3,3],axis=@)
variable value
0 A 0658697
1 A 0986387
2 A 9.980378
3 2 0,84828¢
4 3 0,60038¢
5 3 0.131574
6 c 9.188967
7 c 0202935
8 © 0431972
9 D 9.960973
10 D 0939120
" D 0903636
Now let's change this dataframe so that it has variable names A.B,C,D. There is a pivot()
method for this.df2 = df.pivot (column:
df2
variable A B
0 0.658697 NaN
10986387 NaN
2 0980378 NaN
3 NaN 0848280
4 NaN 9.60038
5 NaN 9.131574
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 © NaN NaN
11 NaN NaN
ariable’ value:
c D
NaN Nat
NaN Nat
NaN Nat
NaN Nay
NaN Nat
NaN Nat
0.188967 Na
0202935 NaN
0431972 NaN
NaN 0.960972
NaN 0839120,
NaN 9903636
value")
There is a melt() method to rewrite columns line by line
#3 = df2.melt(value_vars=["A","B","C", "D'], value_nane:
3
0
1
2
15
16
7
30
31
32
45
46
47
variable
A
A
A
ooo
value
0.658697
9.986387
9.980378
0.848280
9.600380
0.131574
0.188967
9.202935
9.431972
9.960973
9.839120
9.903636
alue" ) éropna()
The merge) method is used to merge two dataframes using a key column:['Al1", "Baran®, ‘Mehmet"] , "Y1":[97,85,76]}
"Ali", "Baran’, Umut"] , "Y2":[75,94,96]}
df4 = pd.DataFrame(dict2)
df5 = pd.DataFrane(dict3)
print (dF4)
print (d#5)
x YL
Ali 97
Baran 85
Mehmet 76
x y2
Ali 75
Baran 94
Umut 96
#6 = pd.merge(dF4,df5,how=" Left’ ,on="X")
df6
x YI v2
° Ali 97 75.0
1 Baran 8§ 940
2 Mehmet 76 NaN
d€7 = pd.merge(df4, df5, ho
d¢7
right ,on="X")
0 Ali 970 75
1 Baran 95,0 94
2 Umut NaN 96
£8 = pd.merge(df4,df5,how=" inner’ ,on="X")
dfs
x vt 2
Oo Ali 97. 75
1 Baran BS 94
d£9 = pd.merge(dfa,dfS,how='outer’ ,on='X")
df9x
° Ali
1 Baran
2 Mehmet
3 Umut
‘1
97
850
766
NaN
750
940
NaN
960