Data analysis in python
Dr. Santosh Prasad Gupta
Assistant Professor
Department of Physics
Patna University, Patna
6/27/2021 Department of Physics, PU: SP Gupta 1
In this document, we learn about:
File handling: reading and writing files, along with many other file handling
options, to operate on files.
Statistics of a data: such as mean, median, variance, standard deviation and
other parameters using numpy and pandas
Normal data directly written
Imported normal data
Statistics of a data: such as mean, standard deviation using numpy and
pandas along with data visualization
Imported frequency data
Imported frequency group data
6/27/2021 Department of Physics, PU: SP Gupta 2
Create a file and handling using Python
File handling is a very important concept for any programmer. It can be used for
creating, deleting, moving files or to store application data, user configurations, videos,
images, etc. Python too supports file handling and allows users to handle files i.e., to
read and write files, along with many other file handling options, to operate on files.
Write Only (‘r’): Open the file for reading.
Write Only (‘w’): Open the file for writing. For an existing file, the data is truncated
and over-written.
Write and Read (‘w+’): Open the file for reading and writing. For an existing file,
data is truncated and over-written.
Append Only (‘a’): Open the file for writing. The data being written will be inserted at
the end, after the existing data.
Append and Read (‘a+’): Open the file for reading and writing. The data being written
will be inserted at the end, after the existing data.
6/27/2021 Department of Physics, PU: SP Gupta 3
Python code
# script file for creating a file and writing
data = open("D:\\PWC\\data_analysis\\test.txt", "w") Out put
data.write("ram \t shyam \n 1 \t 2 \n 3 \t 4") ram shyam
data.close() 1 2
# reading the file after writing 3 4
data = open("D:\\PWC\\data_analysis\\test.txt", “r")
print(data.read())
# script file for creating a file and writing
data = open("D:\\PWC\\data_analysis\\test.txt", "w") ram shyam
data.write("ram \t shyam \n 1 \t 2 \n 3 \t 4") 1 2
data.close() 3 4
# script file for opening a file in appending mode Hello! I have added
data = open("D:\\PWC\\data_analysis\\test.txt", “a")
data.write("""\n Hello! I have added….
this is one way of
\n this is one way of
\n multi-line writing""")
data.close() multi-line writing
# reading the file after writing and appending
data = open("D:\\PWC\\data_analysis\\test.txt", “r")
print(data.read())
6/27/2021 Department of Physics, PU: SP Gupta 4
Statistics of a data using numpy
Mean or Average: Average a number expressing the central or typical value in a set of
data, in particular the mode, median, or (most commonly) the mean, which is calculated
by dividing the sum of the values in the set by their number. The basic formula for the
average of n numbers x1, x2, ……xn is # Python program to get average of a list
𝑥1:𝑥2: …….:𝑥𝑛
𝑥𝑚𝑒𝑎𝑛 = # Importing the NumPy module
𝑛
import numpy as np
# Taking a list of elements Out put
Use: np.average
list = [2, 4, 4, 4, 5, 5, 7, 9] Mean is: 5.0
# Calculating average using average()
print(‘mean is:’, np.average(list))
Median: Median is the value that separates the higher half of a data sample or
probability distribution from the lower half. For odd set of elements, the median
value is the middle one. For even set of elements, the median value is the mean of
two middle elements. # Python program to get average of a list
# Importing the NumPy module
import numpy as np Out put
Use: np.median # Taking a list of elements Median is: 4.5
list = [2, 4, 4, 4, 5, 5, 7, 9]
# Calculating median using median()
print(‘median is:’, np.median(list))
6/27/2021 Department of Physics, PU: SP Gupta 5
Variance
Variance is the sum of squares of differences between all numbers and means.
The mathematical formula for variance is as follows,
𝑁
# Python program to get variance of a list 𝑥𝑚𝑒𝑎𝑛 ;𝑥1 2
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = σ2 = 𝑖=1
# Importing the NumPy module 𝑁
import numpy as np Out put
# Taking a list of elements
variance is: 18133.359999999997
list = [212, 231, 234, 564, 235]
# Calculating variance using var() Use: np.var
print(‘variance is:’, np.var(list))
Standard Deviation
Standard Deviation is the square root of variance. It is a measure of the extent to which
data varies from the mean. The mathematical formula for calculating standard deviation
is as follows,
Standard deviation = 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = σ
# Python program to get standard deviation of a list
import numpy as np Out put
# Taking a list of elements std. dev. is: 2.0
list = [2, 4, 4, 4, 5, 5, 7, 9]
# Calculating standard deviation using std()
print(‘std. dev. is:’, np.std(list)) Use: np.std
6/27/2021 Department of Physics, PU: SP Gupta 6
Statistics of a data using pandas and numpy
We learn how to import a normal data file then how to calculate the statistical parameters
such mean, median, variance and standard variance
Multicolumn data
Suppose, we have a multicolumn (y1, y2) data, saved in a location in your laptop. Our
objective is to calculate the mean, median, variance, standard deviation of each
column of the data and also the same parameter along with each row.
Here, I have saved the data in the location : D:\PWC\data_analysis\set1.csv, having
name set1 with csv extension. Let us first display the data
Out put
Python script for displaying the file y1,y2
0,100
# reading and printing the file
10,200
data1 = open("D:\\PWC\\data_analysis\\set1.csv", "r")
print(data1.read())
20,400
30,700
40,1200
50,1500
60,1800
70,2000
80,2200
6/27/2021 Department of Physics, PU: SP Gupta 7
Displaying the some information of data using Pandas
Python script Out put
import pandas as pd 9, 2)
import numpy as np <class 'pandas.core.frame.DataFrame'>
import statistics as st RangeIndex: 9 entries, 0 to 8
Data columns (total 2 columns):
# Load the data or importing a data file # Column Non-Null Count Dtype
data1 = pd.read_csv("D:\\PWC\\data_analysis\\set1.csv") --- ------ -------------- -----
# print some information of data 0 y1 9 non-null int64
print(data1.shape) 1 y2 9 non-null int64
print(data1.info()) dtypes: int64(2)
# print few lines of data memory usage: 272.0 bytes
print(data1.head()) None
y1 y2
0 0 100
data.info gives information about the file 1 10 200
2 20 400
and data.head give the first five line of the data
3 30 700
4 40 1200
6/27/2021 Department of Physics, PU: SP Gupta 8
Statistics of a data using pandas and numpy: mean
We learn how to import a normal data file then how to calculate the statistical
parameters such mean, variance and standard variance
Out put
import pandas as pd
import numpy as np mean of y1.: 40.0
import statistics as st mean of y2.: 1122.22222
row-wise mean
# Load the data or importing a data file 0 50.0
data1 = pd.read_csv("D:\\PWC\\data_analysis\\set1.csv") 1 105.0
# calculating mean of y1 and y2 column-wise using mean 2 210.0
print('mean of y1.:', data1.loc[:,‘y1'].mean()) 3 365.0
print('mean of y2.:', data1.loc[:,‘y2'].mean()) 4 620.0
5 775.0
# calculating mean of y1 and y2 row-wise 6 930.0
print('row-wise mean\n', data1.mean(axis = 1)[0:7]) dtype: float64
calculate the mean of the rows by specifying the (axis = 1) argument. The code below
[0:7] calculates the mean of the first seven rows.
6/27/2021 Department of Physics, PU: SP Gupta 9
Statistics of a data using pandas and numpy: median, variance, standard deviation using attributes
median, var, std
Out put
import pandas as pd
import numpy as np median of y1.: 40.0
import statistics as st median of y2.: 1200.0
# Load the data or importing a data file variance of y1: 750.0
data1 = pd.read_csv("D:\\PWC\\data_analysis\\set1.csv") variance of y2: 641944.4444444445
std. dev. of y1: 27.386127875258307
# calculating median of y1 and y1 column-wise using median
std. dev. of y2: 801.2143561147943
print('median of y1.:', data1.loc[:,‘y1'].median())
row-wise median:
print('median of y2.:', data1.loc[:,‘y2'].median()) 0 50.0
# calculating variance of y1 and y2 column-wise using var 1 105.0
print('variance of y1:', data1.loc[:,‘y1'].var()) dtype: float64
print('variance of y2:', data1.loc[:,‘y2'].var()) row-wise variance:
# calculating std. dev. of y1 and y2 column-wise using std 0 5000.0
print('std. dev. of y1:', data1.loc[:,‘y1'].std()) 1 18050.0
print('std. dev. of y2:', data1.loc[:,‘y2'].std()) dtype: float64
row-wise std. dev.:
# calculating median of y1 and y2 row-wise first two rows 0 70.710678
print('row-wise median:\n', data1.median(axis = 1)[0:2]) 1 134.350288
# calculating variance of y1 and y2 row-wise first two rows dtype: float64
print('row-wise variance:\n', data1.var(axis = 1)[0:2])
# calculating variance of y1 and y2 row-wise first two row
print('row-wise std. dev.:\n', data1.std(axis = 1)[0:2])
6/27/2021 Department of Physics, PU: SP Gupta 10
Statistics of a data using pandas and numpy: All values together by using describe
import pandas as pd
import numpy as np
import statistics as st
# Load the data or importing a data file
data1 = pd.read_csv("D:\\PWC\\data_analysis\\set1.csv")
# calculated important statistical parameter at once
print(data1.describe())
Out put
y1 y2
count 9.000000 9.000000
mean 40.000000 1122.222222
std 27.386128 801.214356
min 0.000000 100.000000
25% 20.000000 400.000000
50% 40.000000 1200.000000
75% 60.000000 1800.000000
max 80.000000 2200.000000
6/27/2021 Department of Physics, PU: SP Gupta 11
Calculation of mean and standard deviation of a data
Suppose we have a x-ray diffraction data; variation of intensity with angle as shown in the table
below. For that we want to calculate mean and standard deviation.
Theta (T) (in degree) Intensity (I) (in counts)
20 1
30 5
40 10
50 15
60 11
70 9
80 2
In order to calculate mean and standard deviation. We will follow the following steps.
First calculate (theta x intensity) that is (T I)
𝑻𝑰
Calculate mean: 𝑻𝒎𝒆𝒂𝒏 = 𝑻𝒎 =
𝑰
Calculate 𝑻 − 𝑻𝒎 and then calculate 𝑻 − 𝑻𝒎 𝟐 𝑰
𝑻 ; 𝑻𝒎 𝟐𝑰
Calculate standard deviation: σ =
𝐼
6/27/2021 Department of Physics, PU: SP Gupta 12
Table for calculating the various terms
T I TI T - Tm (T - Tm)^2 I
20 1 20 -32.26 1040.7076
30 5 150 -22.26 2477.538
40 10 400 -12.26 1503.076
50 15 750 -2.26 76.614
60 11 660 7.74 658.9836
70 9 630 17.74 2832.3684
80 2 160 27.74 1539.0152
𝐼 = 53
𝑇𝐼 = 𝑇 − 𝑇𝑚 2 𝐼
2770 = 10128.3028
𝑻 𝑰 2770
𝑻𝒎𝒆𝒂𝒏 = 𝑻𝒎 = = ≈ 52.26
𝑰 53
𝑻 ; 𝑻𝒎 𝟐𝑰 10128.3028
𝝈= = = 191.10 ≈ 13.82
𝑰 53
6/27/2021 Department of Physics, PU: SP Gupta 13
Calculation of mean and standard deviation of the same data using python
Let us first visualize the data
Out put
Python script for displaying the data file T,I
# reading and printing the file 20,1
da2 = open("D:\\PWC\\data_analysis\\set2.csv", "r") 30,5
print(da2.read()) 40,10
50,15
displaying the data file using bar plot 60,11
import pandas as pd 70,9
import numpy as np 80,2
import matplotlib.pyplot as plt
# Load the data or importing a data file
da2 = pd.read_csv("D:\\PWC\\data_analysis\\set2.csv")
#Creating the bar plot
plt.bar(da2['T'], da2['I'], color='orange', width=1)
# Labeling the X and Y axis
plt.xlabel("Theta(in degree)")
plt.ylabel("Intensity (in counts)")
plt.title("Variation of Intensity(I) with Theta(T)")
plt.show()
6/27/2021 Department of Physics, PU: SP Gupta 14
Calculation of different components as shown in the previous calculation table and also
calculation of mean and standard deviation using python
import pandas as pd Out put
import numpy as np mean (Tm) is: 52.2642
import matplotlib.pyplot as plt The column T-Tm is:
from math import sqrt 0 -32.264151
# Load the data or importing a data file 1 -22.264151
2 -12.264151
da2 = pd.read_csv("D:\\PWC\\data_analysis\\set2.csv")
3 -2.264151
#calculation of multiplication of theta (T) and Intensity (I): T I 4 7.735849
da2['TI']= da2['T']*da2['I'] 5 17.735849
#calculation of mean Tm: sum of da2['TI']/sum of da2['I'] and printing 6 27.735849
Tm=da2['TI'].sum()/da2['I'].sum() Name: T-Tm, dtype: float64
print('mean (Tm) is:', Tm) The column (T-Tm)^2I is:
#calculation of T-Tm and printing 0 1040.975436
1 2478.462086
da2['T-Tm']= da2['T']-Tm 2 1504.093984
print('The column T-Tm is:\n', da2['T-Tm']) 3 76.895692
#calculation of (T-Tm)^2I and printing 4 658.276967
da2['(T-Tm)^2I']= (da2['T-Tm'])*(da2['T-Tm'])*da2['I'] 5 2831.043076
print('The column (T-Tm)^2I is:\n', da2['(T-Tm)^2I']) 6 1538.554646
#calculation of sigma: sqrt(sum of(T-Tm)^2I /sum of da2['I'])and printing Name: (T-Tm)^2I, dtype: float64
Standard deviation (sigma) is: 13.8238
sigma=sqrt(da2['(T-Tm)^2I'].sum()/da2['I'].sum())
print('Standard deviation (sigma) is:',sigma)
6/27/2021 Department of Physics, PU: SP Gupta 15
Calculation of standard deviation for a group data
Problem: In a class of students, 9 students scored 50 to 60, 7 students scored 61 to 70,
9 students scored 71 to 85, 12 students scored 86 to 95 and 8 students scored 96 to 100
in the subject of mathematics. Estimate the standard deviation?
Solution: The variation of number of students with their score is summarized in the
following
Score (M) No. of students (S)
50-60 9
61-70 7
71-85 9
86-95 12
96-100 8
We will estimate the standard deviation by using the following steps.
Step1: find the mid-point (Md) for each group or range of the score.
step 2: calculate the number of samples of a data set by summing up the no. of students (sum of S).
step 3: find the mean for the grouped data (Mm) by dividing the addition of multiplication of each
group mid-point and no. of students of the data set by the number of samples.
step 5: Estimate standard deviation for the frequency table by taking square root of the variance as
𝑴𝒅 − 𝑴𝒎 𝟐 S
σ=
𝑆
.6/27/2021 Department of Physics, PU: SP Gupta 16
Table for calculating the various terms
M S Md Md S Md - Mm (Md - Mm)^2 S
50-60 9 55.0 495.0 -23.34 4904.67
61-70 7 65.0 458.5 -12.84 1154.86
71-85 9 78.0 702.0 0.34 1.07
86-95 12 90.5 1086.0 12.16 1773.09
96-100 8 98 784.0 19.66 3090.73
𝑆 = 45
𝑀𝑑 𝑆 = 𝑀𝑑 − 𝑀𝑚 2 𝑆
3525.5 = 10924.41
𝑴𝒅 𝑺 3525.5
𝑴𝒎 = = ≈ 78.34
𝑺 45
𝑴𝒅; 𝑴𝒎 𝟐 S 10924.41
𝝈= = = 191.10 ≈ 15.58
𝑺 45
6/27/2021 Department of Physics, PU: SP Gupta 17
Calculation of mean and standard deviation of the same data using python
Let us first visualize the data Out put
Python script for displaying the data file M,S
# reading and printing the file 50-60,9
da3 = open("D:\\PWC\\data_analysis\\set3.csv", "r") 61-70,7
print(da3.read()) 71-85,9
86-95,12
displaying the data file using bar plot 96-100,8
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Load the data or importing a data file
da3 = pd.read_csv("D:\\PWC\\data_analysis\\set3.csv")
#Creating the bar plot (you can use bar plot also)
#plt.bar(da3['M'], da3['S'], color='blue', width=1)
#Creating the scatter plot
plt.scatter(da3['M'], da3['S'], color='blue')
# Labeling the X and Y axis
plt.xlabel("Score obtained by student")
plt.ylabel("Number of student")
plt.title("Variation of number of students with their score ")
plt.show()
6/27/2021 Department of Physics, PU: SP Gupta 18
Calculation of different components as shown in the previous calculation table and also
calculation of mean and standard deviation using python
import pandas as pd The column Md is:
: 0 55.0
import numpy as np
1 65.5
import matplotlib.pyplot as plt 2 78.0
from math import sqrt 3 90.5
# Load the data or importing a data file 4 98.0
da3 = pd.read_csv("D:\\PWC\\data_analysis\\set3.csv") Name: Md, dtype: float64
#creating the column corresponding to mid of the range and printing The column MdS is:
: 0 495.0
da3[['U','L']]=da3['M'].str.split('-',expand=True)
1 458.5
da3['Md']=(da3['U'].astype(float)+ da3['L'].astype(float))/2 2 702.0
print('The column Md is:\n:', da3['Md']) 3 1086.0
#calculation of multiplication of score(Md) and students (S): Md S 4 784.0
da3['MdS']= da3['Md']*da3['S'] Name: MdS, dtype: float64
print('The column MdS is:\n:', da3['MdS']) mean (Mm) is: 78.3444
The column Md-Mm is:
#calculation of mean Mm: sum of da3['MdS']/sum of da3['S'] and printing
0 -23.344444
Mm=da3['MdS'].sum()/da3['S'].sum() 1 -12.844444
print('mean (Mm) is:', Mm) 2 -0.344444
#calculation of Md-Mm and printing 3 12.155556
da3['Md-Mm']= da3['Md']-Mm 4 19.655556
print('The column Md-Mm is:\n', da3['Md-Mm']) Name: Md-Mm, dtype: float64
The column (Md-Mm)^2S is:
#calculation of (Md-Mm)^2S and printing
0 4904.667778
da3['(Md-Mm)^2S']= (da3['Md-Mm'])*(da3['Md-Mm'])*da3['S'] 1 1154.858272
print('The column (Md-Mm)^2S is:\n', da3['(Md-Mm)^2S']) 2 1.067778
#calculation of sigma: sqrt(sum of(Md-Mm)^2S /sum of da3['S'])and printing 3 1773.090370
sigma=sqrt(da3['(Md-Mm)^2S'].sum()/da3['S'].sum()) 4 3090.726914
print('Standard deviation (sigma) is:',sigma) Name: (Md-Mm)^2S, dtype: float64
Standard deviation (sigma) is: 15.5809
6/27/2021 Department of Physics, PU: SP Gupta 19
Assignments
Question 1. Distribution of marks obtained by M.Sc. Students are given below.
40, 65, 45, 50, 80, 55, 76, 72, 62, 82, 59, 51, 61. Find the mean, median, variance,
standard deviation of this distribution.
Question 2: Estimate the standard deviation for the data of single slit pattern, given
below.
Theta (in degree) Intensity (in counts)
-50 1
-30 5
-10 10
0 15
10 11
30 9
50 2
Question 3: In a village, 200 peoples are in the age group (year) 20 to 30, 300 peoples
are in the age group 31 to 40, 600 peoples are in the age group 41 to 60, and only 100
peoples are in the age group 61 to 90. Estimate the standard deviation.
6/27/2021 Department of Physics, PU: SP Gupta 20