Data Science Cs3362 Lab Record
Data Science Cs3362 Lab Record
LABORATORY RECORD
Register No……………………………………………………………
Department……………………………………………………………
Submitted to the Anna University, Chennai B.E. Examination for ………….. Semester,
conducted on ………………..
PAGE MARKS
Sl. No DATE NAME OF THE EXPERIMENT NO (20) Signature
AIM:
To learn how to download and install the different packages of NumPy, SciPy,
Jupyter, Statsmodels and Pandas.
ALGORITHM:
Python Installation
Jupyter Installation
● Open command prompt and enter the following to check whether the pyton
was installed properly or not, “python –version”.
● If installation is proper it returns the version of python
● Enter the following to check whether the pyton package manager was
installed properly or not, “pip –version”
● If installation is proper it returns the version of python package manager
● Enter the following command “pip install jupyterlab”.
● Enter the following command “pip install jupyter notebook”.
● Copy the above command result from path to upgrade command and paste it
and execute for upgrade process.
● Create a folder and name the folder accordingly.
● Open command prompt and enter in to that folder. Enter the following code
“jupyter notebook” and then give enter.
● Now new jupyter notebook will be opened for our use.
pip Installation
Installation of NumPy
● pip install numpy
Installation of SciPy
● pip install scipy
Installation of Statsmodels
● pip install statsmodels
Installation of Pandas
● pip install pandas
1
Sample Output
RESULT:
NumPy, SciPy, Jupyter, Statsmodels and Pandas packages were installed
properly and the execution also verified
2
WORKING WITH NUMPY ARRAYS
Aim:
To implement array object using Numpy module in Python
programming
Algorithm
Step 1: Start the program
Step 2: Import the required packages
Step 3: Read the elements through list/tuple/dictionary
Step 4: Convert List/tuple/dictionary into array using built-in methods.
Step 5: Check the number of dimensions in an array
Step 6: Compute the shape of an array or if it’s required reshape an array
Step 7: Do the required operations like slicing, iterating, searching, concatenating and
splitting an array element.
Step 8: Stop the program
2-D Arrays
Program
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)
Output[[1
2 3]
[4 5 6]]
3
3-D arrays
Program
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2,
3], [4, 5, 6]]]) print(arr)
Output
[[[1 2 3]
[4 5 6]]
[[1 2 3]
[4 5 6]]]
Program
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[2] + arr[3])
Output
7
4
Program
import numpy as np
arr = np.array([[1, 2, 3,
4], [5, 6, 7, 8]])
print(arr.shape)
5Output(2,
4)
5
3)print(newarr)
Output
[array([1, 2]), array([3, 4]), array([5, 6])]
RESULT:
Thus Array objects have been explored using the Numpy module in Python
programming successfully.
6
WORKING WITH PANDAS DATA FRAMES
Aim:
To work with Data Frame object using Pandas module in Python Programming
Algorithm:
Step 1: Start the program
Step 2: Import the required packages
Step 3: Create a Data Frame using built in methods.
Step 4: Load data into a Dataframe object otherwise Load Files(excel/csv)
into DataFrame
Step 5: Display the rows and describe the data set using the
built in method. Step 6: Display the last 5 rows of the Data
Frame.
Step 7: Check the number of maximum returned rows
Step 8: Stop the program
7
Note: When using [], the result is a Pandas DataFrame.
(v) Named Indexes
Program
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1",
"day2", "day3"])
print(df)
Output
calories duration
9day1 420 50
day2 380 40
day3 390 45
(vi) Locate Named Indexes
print(df.loc["day2"])
Output
calories 380
duration 40
Name: 0, dtype: int64
(vii) Load Files Into a DataFrame
Program
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
Output
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 2 60 60 117 103 145 135 479.0 340.0
3 45 109 175
282.4 4 45 117
148 406.0 20
.. ... ... ... ...
164 60 105 140 290.8
165 60 110 145 300.4
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4
[169 rows x 4 columns]
(viii) Check the number of maximum returned rows:
Program
import pandas as pd
print(pd.options.display.max_rows)
In my system the number is 60, which means that if the DataFrame
contains more than 60 rows, the print(df) statement will return only the
headers and the first andlast 5 rows.
import pandas as pd
8
pd.options.display.max_rows = 9999
df = pd.read_csv('data.csv')
print(df)
(ix) Viewing the Data
Program
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(4))
Output
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
Print the last 5 rows oftheDataFrame:
print(df.tail())
print(df.info())
Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
# Column Non-Null Count Dtype
110 Duration 169 non-null int64
1 Pulse 169 non-null int64
2 Maxpulse 169 non-null int64
3 Calories 164 non-null float64
dtypes: float64(1),int64(3)
memory usage: 5.4 KB
None
RESULT:
Thus Data Frame object using Pandas module in Python Programming has been
successfully explored.
9
READING DATA FROM IRIS DATA SET AND DOING
DESCRIPTIVE ANALYTICS ON THE IRIS DATA SET
Aim:
To perform descriptive analytics on Iris dataset using Python programming.
Algorithm:
Step 1: Start the program
Step 2: Import the required packages
Step 3: Load Files(excel/csv/ text) into a Dataframe from Iris data set Step 4: Display
the rows and describe the data set using built in methods
Step 5: Compare Petal Length and Petal Width
Step 6: Visualize the data set using histogram with distplot,heatmapsbox plots
methods
Step 7: Check Missing Values, Duplicates and remove outliers Step 8: Stop the
program
Program
import pandas as pd
# Reading the CSV file
df = pd.read_csv("Iris.csv")
# Printing top 5 rows
df.head()
Output:
Output
10
df.describe()
Checking Duplicates
data = df.drop_duplicates(subset ="Species",) data
Output
df.value_counts("Species")
Data Visualization
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x='Species', data=df, )
14plt.show()
11
Comparing Sepal Length and Sepal Width
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt sns.scatterplot(x='SepalLengthCm', y='SepalWidthCm',
hue='Species', data=df, )# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.show()
Output:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt sns.pairplot(df.drop(['Id'], axis = 1),
hue='Species', height=2)
12
Histograms Program
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 2, figsize=(10,10)) axes[0,0].set_title("Sepal Length")
axes[0,0].hist(df['SepalLengthCm'], bins=7) axes[0,1].set_title("Sepal Width")
axes[0,1].hist(df['SepalWidthCm'], bins=5) axes[1,0].set_title("Petal Length")
axes[1,0].hist(df['PetalLengthCm'], bins=6) axes[1,1].set_title("Petal Width")
axes[1,1].hist(df['PetalWidthCm'], bins=6)
Output:
Output:
13
Program
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
#Load the dataset
df = pd.read_csv('Iris.csv')
sns.boxplot(x='SepalWidthCm', data=df)
Output:
RESULT:
Thus Iris dataset has been explored and descriptively analysed using Python
programming.
14
PERFORM UNIVARIATE ANALYSIS ON THE DIABETES DATA SET
AIM:
Use the diabetes data set from UCI and Pima Indians Diabetes data set for Univariate
analysis.
ALGORITHM:
1. Download diabetes data set from UCI and Pima Indians Diabetes data set.
2. Load the above data files into google colab.
3. Perform analysis like Frequency, Mean, Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis.
Univariate analysis
● The term univariate analysis refers to the analysis of one variable.
● There are three common ways to perform univariate analysis on one variable:
Summary statistics – Measures the center and spread of values.
1. Central tendency — mean, median, mode
2. Dispersion — variance, standard deviation, range, interquartile
range (IQR)
3. Skewness — symmetry of data along with mean value
4. Kurtosis — peakedness of data at mean value
5. Frequency table – Describes how often different values occur.
File Importing:
# Reading the UCI
file import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/Data_Science/UCI_diabetes.csv")
# Printing top 5
rows df.head()
# Reading the Pima file
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv")
# Printing top 5
rows df.head()
1. Central Tendency
We can use the following syntax to calculate various summary statistics like Mean,
19Median and Mode.
Mean:
It is average value of given numeric values
● Mean of UCI data
import pandas as pd
# Reading the UCI file
df = pd.read_csv("/content/drive/MyDrive/Data_Science/UCI_diabetes.cs v")
# Mean of UCI data
df.mean(axis=0)
● Mean of Pima data
import pandas as pd
# Reading the UCI
file df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv") #
Mean of Pima data
15
df.mean(axis=0)
Median:
It is middle most value of given values
● Median of UCI data
import pandas as pd
# Reading the UCI file
df = pd.read_csv("/content/drive/MyDrive/Data_Science/UCI_diabetes.csv") #
Median of UCI data
df.median(axis=0)
● Median of Pima data
import pandas as pd
# Reading the UCI file
df = pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv") #
Median of Pima data
df.median(axis=0)
Mode:
It is the most frequently occurring value of given numeric variables
● Mode of UCI data
import pandas as pd
# Reading the UCI
file df =
pd.read_csv("/content/drive/MyDrive/Data_Science/UCI_diabetes.csv") #
Median of UCI data
df.mode(axis=0)
● Mode of Pima data
import pandas as pd
# Reading the UCI
file df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv") #
Mean of Pima data
df.mode(axis=0)
2. Dispersion
Variance
The range is the difference between the maximum and minimum values of a data set.
Example
import pandas as pd
# Reading the UCI
file df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv") #
variance of the BMI column
df.loc[:,"BMI"].var()
Standard deviation
Standard deviation is a measure of how spread out the numbers are. A large
standard deviation indicates that the data is spread out, - a small standard deviation
indicates that the data is clustered closely around the mean.
16
Example
import pandas as pd
# Reading the UCI
file df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv")
# Standard deviation of the BMI column
df.loc[:,"BMI"].std()
Range
Range is the simplest of the measurements but is very limited in its use, we
21calculate the range by taking the largest value of the dataset and subtract the
smallest value
from it, in other words, it is the difference of the maximum and minimum values of a
dataset.
Example
df=pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.cs v")
print("Range is:",df.BloodPressure.max()-df.BloodPressure.min())
Interquartile range
The interquartile range, often denoted “IQR”, is a way to measure the spread of the
middle 50% of a dataset. It is calculated as the difference between the first quartile*
(the 25th percentile) and the third quartile (the 75th percentile) of a dataset.
Example
# Importing important libraries
import numpy as np
import pandas as pd
import seaborn as
sns
import matplotlib.pyplot as plt
plt.style.use('seaborn')
data=pd.read_csv('/content/drive/MyDrive/Data_Science/Pima_diabetes.csv')
# Removing the outliers
def removeOutliers(data, col):
Q3 = np.quantile(data[col],
0.75) Q1 =
np.quantile(data[col], 0.25)
IQR = Q3 - Q1
print("IQR value for column %s is: %s" % (col, IQR))
global outlier_free_list
global filtered_data
lower_range = Q1 - 1.5 * IQR
upper_range = Q3 + 1.5 * IQR
outlier_free_list = [x for x in data[col] if
(
(x > lower_range) & (x < upper_range))]
filtered_data = data.loc[data[col].isin(outlier_free_list)]
for i in data.columns:
if i == data.columns[0]:
removeOutliers(data, i)
else:
17
removeOutliers(filtered_data, i)
# Assigning filtered data back to our original variable
data = filtered_data
print("Shape of data after outlier removal is: ", data.shape)
3. Skewness
4. kurtosis
kurtosis determines the heaviness of the distribution tails.
Example
import pandas as
pd df =
pd.read_csv('/content/drive/MyDrive/Data_Science/Pima_diabetes.csv')
df['BloodPressure'].kurtosis()
5. Frequency
Frequency is a count of the number of occurrences a particular value occurs or
appears in our data. A frequency table displays a set of values along with the
frequency with which they appear. They allow us to better understand which data
values are common and which are uncommon.
Example
# import packages
import pandas as
pd import numpy
as np # reading csv file
data = pd.read_csv('/content/drive/MyDrive/Data_Science/Pima_diabetes.csv')
# one way frequency table for the species column.
freq_table = pd.crosstab(data['Age'], 'BMI')
23# frequency table in proportion of species
freq_table= freq_table/len(data)
freq_table
Sample Output
18
19
RESULT
Thus the Univariate analysis on the Diabetes data of UCI and Pima was performed
successfully
20
PERFORM BIVARIATE ANALYSIS ON THE DIABETES DATA SET
AIM:
To use the UCI and Pima Indians Diabetes data set for Bivariate analysis.
ALGORITHM:
1. Download diabetes data set from UCI and Pima Indians Diabetes data set.
2. Load the above data files into google colab.
3. Perform various methods of bivariate.
Bivariate analysis
The term bivariate analysis refers to the analysis of two variables. The purpose of
bivariate analysis is to understand the relationship between two variables
Example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# import packages
data =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv") #
Diabetes Outcome
g1 = data.loc[data.Outcome==1,:]
# Pregnancies, Glucose and Diabetes relation
g1.plot.scatter('Pregnancies', 'Glucose');
2. Correlation Coefficients
The correlation coefficient is a statistical measure of the strength of the relationship
between the relative movements of two variables. The values range between -1.0 and
1.0. Correlation of -1.0 shows a perfect negative correlation, while a correlation of 1.0
shows a perfect positive correlation. A correlation of 0.0 shows no linear relationship
between the movement of the two variables.
Example
# Import those libraries import
pandas as pd
from scipy.stats
import pearsonr
# Import your data into Python
21
df = pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv") #
Convert dataframe into series
list1 = df['BloodPressure']
list2 = df['SkinThickness']
# Apply the pearsonr()
corr, _ = pearsonr(list1, list2) print('Pearsons
correlation: %.3f' % corr)
Example
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset =
pd.read_csv('/content/drive/MyDrive/Data_Science/Pima_diabetes.csv') X
= dataset.iloc[:, :-1].values #get a copy of dataset exclude last column y =
dataset.iloc[:, 1].values #get array of dataset in column 1st
# Splitting the dataset into the Training set and Test set from
sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3,
random_state=0)
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Sample Output
22
RESULT:
Thus the Bivariate analysis on the diabetes data set was executed successfully.
23
PERFORM MULTIPLE REGRESSION ANALYSIS ON THE DIABETES DATA SET
AIM:
To use UCI and Pima Indians Diabetes data set for Multiple Regression Analysis.
ALGORITHM:
1. Download diabetes data set from UCI and Pima Indians Diabetes data set.
2. Load the above data files into google colab.
3. Perform multiple regression analysis on data sets.
Example
# Pima_diabetes
import pandas
from sklearn import linear_model df
=
pandas.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.cs v")
X = df['Pregnancies ', 'Glucose '] y
= df['BloodPressure ']
regr = linear_model.LinearRegression()
regr.fit(X, y)
#predict the Blood Pressure based on Pregnancies and Glucose level:
predictedBP = regr.predict([[4, 120]])
print(predictedBP)
# UCI-Diabetes
import pandas
from sklearn import linear_model
df = pandas.read_csv("("/content/drive/MyDrive/Data_Science/UCI_diabetes.
csv")
X = df[['Time', 'Code']]
y = df['Value']
regr = linear_model.LinearRegression()
regr.fit(X, y)
#predict the Diabetes based on Time and Code:
predictedBP = regr.predict([[13:23, 46]])
print(predictedBP)
Sample Output
24
RESULT :
Thus the Multiple Regression analysis on the Diabetes data of UCI and Pima was
performed successfully.
25
APPLY AND EXPLORE NORMAL CURVES & HISTOGRAMS PLOTTING
FUNCTIONS ON UCI-IRIS DATA SETS
AIM:
To apply and explore Normal curves & Histograms plotting functions on UCI-Iris
data sets.
ALGORITHM:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the normal curve and Histograms for Iris data set.
Normal Curves
It is a probability function used in statistics that tells about how the data values are
distributed. It is the most important probability distribution function used in
statistics because of its advantages in real case scenarios.
Example
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import statistics
# import dataset
df = pd.read_csv("/content/drive/MyDrive/Data_Science/iris.csv") #
Plot between -10 and 10 with .001 steps.
x_axis = np.arange(-20, 20, 0.01)
# Calculating mean and standard deviation
mean = df["sepal.length"].mean()
sd = df.loc[:,"sepal.width"].std()
plt.plot(x_axis, norm.pdf(x_axis, mean, sd))
plt.show()
RESULT :
Thus the UCI data set was plotted using Normal Curve and Histogram plotting was
executed successfully.
27
DENSITY AND CONTOUR PLOTTING FUNCTIONS ON UCI-IRIS DATA SETS
AIM:
To apply and explore Density & Contour plotting functions on UCI-Iris data sets.
ALGORITHM:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the density and contour plotting for Iris data sets.
Density Plotting
Density Plot is a type of data visualization tool. It is a variation of the histogram that
uses ‘kernel smoothing’ while plotting the values. It is a continuous and smooth
version of a histogram inferred from a data. Density plots uses Kernel Density
Estimation (so they are also known as Kernel density estimation plots or KDE) which
is a probability density function. The region of plot with a higher peak is the region
with maximum data points residing between those values.
Contour plotting
Contour plots also called level plots are a tool for doing multivariate analysis and
visualizing 3-D plots in 2-D space. If we consider X and Y as our variables we want to
plot then the response Z will be plotted as slices on the X-Y plane due to which
contours are sometimes referred as Z-slices or iso-response. Contour plots are widely
used to visualize density, altitudes or heights of the mountain as well as in the
meteorological department.
28
Example
Sample Output
RESULT
Thus the UCI data set was plotted using Density & Contour plotting was executed
successfully.
29
CORRELATION AND SCATTER PLOTTING FUNCTIONS ON UCI DATA SETS
AIM:
To apply correlation & Scatter plotting functions on UCI-Iris data sets.
ALGORITHM:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the correlation and scatter plotting for Iris data sets.
Example
# Correction Matrix Plot import
matplotlib.pyplot as plt import
pandas
import numpy
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima
indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data =
pandas.read_csv(url, names=names)
correlations = data.corr()
# plot correlation matrix
Fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()
Scatter Plotting
A scatterplot shows the relationship between two variables as dots in two
dimensions, one axis for each attribute. You can create a scatterplot for each pair of
attributes in your data. Drawing all these scatterplots together is called a scatterplot
matrix. Scatter plots are useful for spotting structured relationships between
30
variables, like whether you could summarize the relationship between two variables
with a line. Attributes with structured relationships may also be correlated and good
candidates for removal from your dataset.
Example
# Scatterplot Matrix
import matplotlib.pyplot as plt import pandas
from pandas.plotting import scatter_matrix
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima
indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data =
pandas.read_csv(url, names=names)
scatter_matrix(data) plt.show()
Sample Output
RESULT :
Thus the UCI data set was plotted using Correlation and scatter plotting was executed
successfully.
31
VISUALIZING GEOGRAPHIC DATA WITH BASEMAP
Aim:
To visualize Geographic Data using BaseMap module in Python Programming.
Algorithm:
Step 1: Start the program
Step 2: Import the required packages
Step 3: Visualize Geographic Data with Basemap
Step 4: Display the Base map using built in method like basemap
along with latitude and longitude parameters
Step 5: Display the Coastal lines meters and Country boundaries
using built in methods
Step 6: Fill the Coastal lines meters and Country boundaries with
suitable colours.
Step 7: Create a global map with a Cylindrical Equidistant
Projection, Orthographic Projection, Robinson
ProjectionStep 8: Stop the program
Create a global map with a Ortho Projection
Program
%matplotlib inline import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50,
lon_0=-100) m.bluemarble(scale=0.5);
Output
Program
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None,
width=8E6, height=8E6,lat_0=45, lon_0=-100,) m.etopo(scale=0.5, alpha=0.5)
# Map (long, lat) to (x, y) for plotting x,
y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5) plt.text(x, y, '
INDIA', fontsize=12);
Output
32
Create a global map with a CoastlinesProgram
Program
fig = plt.figure(figsize = (12,12))
m= Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='dashed', color='red')
plt.title("Coastlines", fontsize=20)
plt.show()
33
Output
Output
34
180,urcrnrlon=180)
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='lig
htblue') m.drawcountries(linewidth=1,
linestyle='solid', color='k' )
m.drawmapboundary(fill_color='lightblue' 41)
plt.title(" Cylindrical Equidistant Projection", fontsize=20)
Output
Output
RESULT :
Thus Geographic Data has been visualized using Base Map module in
PythonProgramming successfully
36