[go: up one dir, main page]

0% found this document useful (0 votes)
74 views54 pages

CS3362 Data Science Laboratory Manual 2022-23

The document provides information on performing univariate and bivariate analysis on diabetes datasets. For univariate analysis, descriptive statistics like frequency, mean, median, mode, variance, standard deviation, skewness and kurtosis are calculated on the Pima Indian Diabetes dataset. For bivariate analysis, linear and logistic regression models are applied to analyze the relationship between variables and predict outcomes. The algorithms and Python programs for both analyses are included. Relevant packages like Pandas, NumPy, Scikit-learn, and Matplotlib are imported and used.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views54 pages

CS3362 Data Science Laboratory Manual 2022-23

The document provides information on performing univariate and bivariate analysis on diabetes datasets. For univariate analysis, descriptive statistics like frequency, mean, median, mode, variance, standard deviation, skewness and kurtosis are calculated on the Pima Indian Diabetes dataset. For bivariate analysis, linear and logistic regression models are applied to analyze the relationship between variables and predict outcomes. The algorithms and Python programs for both analyses are included. Relevant packages like Pandas, NumPy, Scikit-learn, and Matplotlib are imported and used.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

CS3362

DATA SCIENCE
LABORATORY
MANUAL
2022-23
1) Download Install and Explore
the Features Numpy,Jupyter,
scipy,pandas and statsmodels

Aim:
To verify and download , install and explore the
features of numpy, scipy, jupyter, statsmodels and
packages

Procedure:
1.To verify and download select version of python
to install
2.Download python Executable installer
3. Run Executable Installer
4.verify python was installed
5.verify pip installed
6.All the packages Install through pip
7.Then we install the Jupyter notebook
Code:
#open command prompt and execute the following
commands
>> pip install numpy
>> pip install jupyter
>> pip install scipy
>> pip install statsmodels
>> pip install pandas
>> pip install matplotlib

Features:
Numpy:
❖It is open source software.
❖Numpy array are speed and faster more compact
than python lists
❖Numpy uses much less memory of store data
❖It can be used perform mathematical operations.
Scipy:
❖SciPy contains varities of sub packages.
❖It is used For scientific computation.
❖Easy to use and understand as well as
computational power.
❖It can operate on an array of Numpy library.
Jupyter:
❖Jupyter notebooks are loacally Run on web
application.
❖Python is the default programming for jupyter.
❖Jupyter Notebooks support many programming
languages.
Statsmodels:
❖Statsmodels is a python module provides
EximationOf statistics.
❖This is open source library .
❖As its name implies statsmodels is a python
library.
Pandas:
❖It is open source library
❖Fast and efficient Data Frame object.
❖Reshaping and pivoting of data sets.
❖Time series functionally.
Result:
Thus that successfully download and install all the
packages and verified.
2) Working with Numpy arrays

Aim:
To verify the working with Numpy arrays.
Algorithm 1:
1. Import the numpy packages.
2. Assign the variable array.
3. Create a Numpy array.
4. Print the type of array.
5. Print the array.
Algorithm 2:
1. Start the program.
2. Import the numpy package.
3. Create an array.
4. Print the type of array.
5. Print the shape of array.
6. Print of size of array.
7. Print the type of elements.
8. Step the program.
Program:
import numpy as np
arr=np.array([1,4,3,2,5])
print(“The type of array is :”, type (arr))
print(“The array is :”, arr)

Program:
import numpy as np
arra=np.array([1,2,3],[4,3])
print(“Array is of type :”, type(arra))
print(“shape of array:”,arra.ndim)
print(“size of array :”,arra.size)
print(“No of dimensions:”,arra.ndim)
print(Array stores elements of type:”,arratype)
Result:
Thus the working with numpy array program is
executed successfully.
3) Working with pandas Data frame

Aim:
To working with pandas Data frame library with
pandas in python.

Algorithm 1:
1. Start the program.
2. Import pandas packages as pd.
3. List the strings store into the variable.
4. Calling the Data frame.
5. Print the Data frame.
6. Stop the program.

Algorithm 2:
1. Start the program.
2. Import pandas packages as pd.
3. Define a dictionary containing Data.
4. Then convert the dictionary into the Data frame.
5. Then print the specific columns.
6. Stop the program.
Program:
import pandas as pd
lst =[‘Mercury’ , ‘Venus’ , ‘Earth’, ‘Mars’, ‘Jupiter’,
‘Saturn’, ‘Uranus’, ‘Neptune’]
df=pd.DataFrame(lst)
print(df)

Program:
import pandas as pd
Data={
‘Name’:
[‘Virat’,‘Johnny Depp',‘Hemsworth’,‘Vijay’],
‘Age’:
[34,59,39,48],
‘Nation’:
[‘India’, ‘America’, ‘Australia', ‘India’]
‘Profession’:
[‘Cricketer’, ‘Actor’, ‘Actor’, ‘Actor’]
}
df= pd.dataframe(data)
print(df)
Result:
Thus the python program to working with pandas
library is coded and executed successfully.
4) Descriptive Analytics on the Iris
Data set

Aim:
Reading data from text files, Excel and the web
and exploring various commands for doing descriptive
analytics on the Iris Data set.

Algorithm:
1. Start the program.
2. Import Numpy, Pandas, Metasploit and seaboin
packages.
3. Download and import the Iris Dataset from UCI
website.
4. Load the file to the variable.
5. Read the file by read -case method using pandas.
6. Assign the column head to each column.
7. Replace and simply the target column.
8. Print the unique in target column.
9. Print the first five Rows.
10. Print the information about the Dataset.
11. Exploratory Data Analysis to start the
Analytics.
12. Analyze the dataset and Display the Result of
every operation.
13. Using seaborn, Metasploit packages display
the Graph, piecharts and histrograms.
14. Stop the program.

Program:
#Import Necessary packages
import numpy as np
import pandas as pd
import seaborn as sns
import matploitlib.pyplot as plt

filepath= “dataset/iris.csv”
df=pd.read_csv(filepath)
ds.columns=[“sepal_length”, “sepal_width”,
“petal_length”, “petal_width”, “target”]
df.target.replace({“Iris-setosa”, “setosa”, “Iris-
versicolor”: “versicolor”, “Iris-virginica”:
“virginica”},inplace=True)
df.target.unique()

#Exploratory Data Analsis


print(df.unique())
print(df.head())
print(df.info())
print(df.describe())
print(df.corr())
print(df.target.value_counts())

#Graph and plots


sns.Facetgrid(df,hue=”target”, height=6).
Map(plt.scatter, “sepal_length”,
“sepal_width”).
Add_legend()
sns.Facetgrid(df,hue=”target”, height=6).
Map(plt.scatter, “petal_length”,
“petal_width”).
Add_legend()
plt.hist(df[“sepal_length”],bins=25):
sns.Facetgrid(df,hue=”target”, height=6).
Map(sns.distplot, “petal_width”).
Add_legend()

sns.Facetgrid(df,hue=”target”, height=6).
Map(sns.distplot, “petal_length”).
Add_legend()
sns.boxplot(x=”target”, y= “petal”, data=df)
Plt.show()
Result:
Thus the python program to analyse various graph
and exploring the Isis dataset is coded and executed
successfully.
5a) Using Diabetes Data Set perform
Univariate Analysis

Aim:
Using Pima Indian Diabetes Dataset get the
frequency , mean, median, mode variance, standard
Deviation, skewness and kurtosis.

Algorithm:
1. Start the program.
2. Import numpy and pandas packages.
3. Download and import the pima Indian Diabetes
dataset from UCI or any other websites.
4. Read the file by read _csv method using pandas.
5. Gathering information about this dataset.
6. Then find the frequency , mean, median, mode,
variance, standard deviation, skewness and kurtosis
by various commands.
7. Displaying the Result.
8. Stop the program.
Program:
import numpy as np
import pandas as pd

df=pd.read_csv(‘diabetes’)
df.head().T
df.shape()
df.isnull().values.any()
df.dtypes
df.[‘outcome’]=df [‘outcome’].astype(‘bool’)
df.info()
df.describe().T
df.value_counts().T
print(“mean values:”)
df.mean()
print(“median values:”)
df.median()
print(“mode values:”)
df.mode()
print(“variance:”)
df.var()
print(“standard deviation:”)
df.std()
print(“skewness:”)
df.skew(axis=0,skipna=true)
print(“kurtosis:”)
df.kurtosis()
Result:
Thus the python program to univariate Analysis on
pima Indian Diabetes dataset was coded and executed
successfully.
5b) Using Diabetes Dataset perform
Bivariate Analysis

Aim:
Using pima Indian Diabetes Dataset to analyze the
linear and logistic regression modelling prediction.

Algorithm:
1. Start the program.
2. Import pandas, matploit, numpy, seaborn and
sklearn packages.
3. Load the diabetes Dataset into the df variable.
4. Analyze diabetes dataset.
5. To start the linear regression modelling.
6. Split the dataset into the two variable.
7. Using the skleran.linear Regression() model train
the splitted dataset.
8. Then predict the trained dataset.
9. Show the predicted score.
10. Show the linear Regression graph.
11. Then we start the logistic Regression.
12. Split the Dataset into two variable.
13. Fit the x,y variable in test.
14. Then create the function for roc curve.
15. The train the model by sklearn.logistic
Regression.
16. Show the logistic Regression graph.
17. Stop the program.

Program:
#import Necessary packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplot lin.pyplot as plt
df=pd.read_csv(‘diabetes.csv’)
df.head()
df.keys()
Print(“shape of the Dataframe:”)
df.shape()
print(“Information About Dataset:”)
df.info()

#Modelling
#import packages and modes
from sklearn.model_selection import traina_test_split
from sklearn.preprocessing import standard scalar
from sklearn.linear_model import linear regression
from sklearn.linear_model import logistic regression
from sklearn.model_selection import
cross_val_predictsta
from sklearn.metrics import
accuracy_score,precision_score, recall_score,
roc_auc_score, fi_score, roc_curve, r2_score,
mean_squared_error
#Linear Regression
X=df[[‘Age’]]
Y=df[‘pregnancies’]
Sc_x=standardScalar()
Sc_y=standardScalar()
Y_std=y
Returnxy=True

X_std_train, x_std_test, y_std_train,


y_std_test=train_test_split(x_stud,y_std,test_size=0.25)

Regr=Linear Regression()
Regr.fit(x_std_train, y_std_train)
Print(“Regression:”, regr.score(x_std_test,y_std_test))
Y_std_pred=regr.predict(x_std_test)
Plt.scatter(x_std_test,y_std_test, color= ‘6’)
plt.scatter(x_std_test,y_std_pred, color= “black”,
linewidth=3)
plt.show()

#The coefficients
print(“coefficients: in”, regr.coef_)

#Logistic Regression
X=df [[‘pregnancies’, ‘glucose’, ‘BMI’, ‘Age’]]
Y=df[“outcome”]
X_train,x_test, y_train, y_test=
train_test_split(x,y,test_size=0.2, random_state=42)

Scalar = standardscalar()
X_train = scalar.fit_transform(x_train)
X_test= scalar. Transform(x_test)
Def plot_roc_curve(for, tor, lable= none):
Plt.plot (for,tor, linewidth=2, label=label)
Plt.plot([0,1],[0,1], “k—“)
Plt.xlabel(“false positive Rate”, fontname=
“monospace”, fontsize=15,weight= “semibold")
Plt.ylabel(“t=True positive Rate(recall)”,
fontname= “monospace”, fontsize=15,weight=
“semibold")
Plt.title(“roc curve”,fontname= “ monospace”,
fontsize=17, weight= “bold”)
Plt.axis([0,1,01])
Plt.show()

Model, auc_scores= [],[]

log_reg_clf= logisticregression(random_state= 42,


max_iter=500)
Log_reg_pred=cross_val_predict( log_reg_clf, x_train,
y_train, cv=5)
log_reg_scores=cross_val_predict(log_reg_clf, x_train,
y_train, cv= 3, method = “decision_function”)
log_reg_fpr, log_reg_tpr, _=roc_curve(y_train,
log_reg_scores)
plot_roc_curve(log_reg_for, log_reg_tpr)
log_reg_auc=roc_auc_score(y_train, log_reg_scores)
print(“roc_score:”, log_reg_aug)
Result:
Thus the python program to perform bivariate
analysis on the pima Indian diabetes dataset was coded
and executed successfully.
5c) Using Diabetes Dataset Perform
Multiple Regression

Aim:
Using Pima Indian Diabetes dataset to Perform. the
Multiple Regression and the predict the Model.

Algorithm:
1. Start the Program..
2. impart Pandas. Sklearn and essential Packages.
3. Load the data set into the Variable.
4. Cleansing the Dataset.
5. Analysing the basic information about the
dataset
6. Split the data set. by Dependent and Independent
Variable X, y
7. Print the initial shape of X and y, then train the
dataset and Print the shape of X and Y.
8. Using sklearn. Linear Regression() fit the model
and train the dataset.
9. Print the intercept. Coefficients, mean squared
error and R2 Score
10. Create the dataFrame with actual and Predict
the Value.
11. Visualize the Actual and Predicted Value by
Matplotlib Package.
12. Stop the Program.

Program:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt.
from sklearn import linear_model
df = pd.read.csv('diabetes.csv").

df. head()
import math
me-ins math floor (df. Insulin.median ())
print ("Null Values in Insulin column : " me_ins)

X= df [['pregnancies', 'Glucose', 'Blood' Pressure',


'SkinThickness' ,'Insulin', 'BMI', 'Diabetes Pedigree
Function', 'Age']]
Y= df ['outcome']

Print ("Shape of x and y Before Train.", X. shape y


shape).

from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test y = train_test_split (X,y,


text Size =0.25, random_state=99)
Print ("Shape of x and y After Train :",train_
x.shape,train_y.shape)

le = linear_model.LinearRegressions()
le.fit(train_x, train_y)
print ("Interception : ", le.intercept_)
print ("Coefficients in: \n",le.coef.)
y_pred = le.predict (test_x)

from sklearn.metrices import mean_squared _error,


r2score
print ("Mean Squared error: ", mean_squared
_error (test_y,y_pred))
Print ("R2 Score", r2_score (test_y, y_Pred))

import matplotlib.pyplot as plt


plt.figure(figsize = (7,7))
plt.scatter (tasty, y-Pred)
plt.xlabel("Actual")
plt.ylabel ("predicted")
plt.title ("Actual Vs. Predicted")
plt.show()

result = pd. DataFrame ({'Actual': test_y, 'predict':y-


Pred})
print ("Actual Vs. Predict")
result head (10)
print("Prediction value of Any One Row :
",le.predict([[6,148,72,35,0,33.6,0.627,50]]))
Result:
Thus the Python program to perform the Multiple
Regression on Pima Indian Diabetes Dataset Was coded
and executed successfully.
5d) Compare the Results of the
analysis of two Dataset

Aim:
Compare the Results two data sets. of the analysis
of the two data sets.

Algorithm:
1. Start the Program.
2. Install the datacompy Package with pip install
datacompy.
3. Import the Pandas and data compy Package. 4.
Load the diabetes dataset
5. Compare the two datasets to the Variable. with
datacompy Package
6. Then the Print report of the comparison.
7. Stop the program.
Program:
import pandas as pd
import datacompy as dc
df1 = pd.read_csv('diabetes.csv')
df2 = pd.read_csv ('diabetes.csv')
Compare = dc.compare (df1,df2, join_columns
='Outcome', abs_tol=0.0001, rel-tol = 0, df1_name =
'olddiabetes',df2_name ='newdiabetes')
Print (Compare.report())
Result:
Thus the python program to compare the results of
the analysis of two datasets was coded and executed
successfully.
6a) Normal Curves

Aim:
Using UCI data sets to show the Normal Curves.

Algorithm:
1. Start the program.
2. Import numpy, matplotlib and essential
Packages.
3. Creating a numpy Series of data in range of 1-
50.
4. Create a function for normal Curves.
5. Calculate mean and Standard deviation.
6. Create the function for plotting the results.
7. Plotting the results by matplotlib Packages.
8. Stop the Program.
Program:
import numpy as np
import matplotlib.pyplot as plt
x=np.linspace(1,50,200)
def normal_dist (x,mean,sd):
Prob_density = (np.pi*d)*np.exp -
0.5*((x·mean)/sd)**2)
return Prob-density
mean = np.mean (x)
sd=np.std(x)
pdf = normal_dist (x, mean, sd)
plt.plot(x,pdf,color = 'red')
plt.xlabel('Data Points')
plt.ylabel('Probability Density')
Result:
Thus the Python Program to plot the Normal
Curves was coded and executed Successfully.
6b) Density and Contour Plots

Aim:
Using the UCI Datasets to plot the Density and
contour Plots.

Algorithm:
1. Start the Program.
2. import numpy, matplotlib and essential
Packages.
3. Create the two Variable which feature_x and
feature_y.
4. Create the numpy array by np.arange function.
5. Creating two Dimensions grid features with
np.mesh grid function.
6. Create a Subplots.
7. Set the title x_label and y_label to the
Contour Plot.
9.Stop the Program.
Program:
import matplotlib.pyplot as plt
import numpy as np
feature_x = np.arange (0,50,2)
feature_y = np.arange (0,50,3)
[x,y] = np.meshgrid(feature_x, feature_y)
fig, ax = plt.subplot (1,1)
z = np.cos (x/2) + np. sin (y/4)
ax.contour(x,y,z)
ax.set_title('Contour Plot')
ax.set_xlabel('feature_x')
ax.set_ylatel('feature_y')
plt.show()
Result:
Thus the Python program to plot the Density
contour Plot was Coded and executed Successfully.
6c) Correlation and Scatter plots.

Aim:
Using Concrete data set plot the correlation and
Scatter Plots.

Algorithm:
1. Start the program
2. Import Pandas and Seaborn Packages.
3. Download and import the 'concrete.csv' dataset
from github website.
4. Read the csv file and load into the Variable.
5. Gathering information about the Dataset.
6. Change the type of cement column as category.
7. Create the Scatter Plot Using Seaborn. 8. Create
the lmplot Using Seaborn.
9. Stop the Program.
Program:
import Pandas as pd
import Seaborn on sns
con= pd.read_csv('concrete.csv')
con ['cement']=con['Cement'].astype ('category')
sns.scatterplot (x="water", y = "coarseagg", data con);
ax-sns.scatterplot (x = "water", y = coarseage", data-
con)
ax.set_title("Concrete Strength Vs. Fly ash")
ax.set_xlabel ('Fly ash')
ax. set_ylabel ("Strength")
sns.lmPlot(x="water", y = "coarseagg", data = con):
Result:
Thus the Python Program to Create the Co and
Scatter plots with Concrete dataset was and executed
successfully.
6d) Histograms

Aim:
Create the numpy arrays and plot the histograms
using matplotlib.

Algorithm:

1. Start the Program.


2. Import numpy and matplotlib packages.
3. Creating the numpy array.
4. Create the histogram using matplotlib
5. Passing the numpy array Value.
6. plot the histogram Using plt.hist function.
7. Create a dataset Using np.random.seed
8. Then Create the distribution by x and y
Variable
10. Plot and Show the histogram.
10.Stop the program.
Program:
import numpy as np.
import matplotlib.pyplot as plt
a = np.array (122, 87, 5, 43, 56, 73, 55, 53, 8, 20, 51, 5,
79, 31,21)
fig, ax = plt.subplots (figsize = (10,7))
ax.hist (a, bins = (0,25,50, 75, 100])
plt.show()
from matplotlib import colors
from matplotlib.ticks import PercentFormatter
np.random.seed (23685752)
N_Points =10000
n_bins = 20
x = np.random.randn (10000) +25
fig, axs = plt.subplots (1,1, figsize = (10,7),tight_layout
=True)
axs.hist(x, bins= n_bins)
Results:
Thus the Python Program to Create the histogram
Using numpy and matplotlib was coded and
executed Successfully.
6e) Three Dimensional Plotting.

Aim:
To implement the Three Dimensional Plotting.
Using matplotlib Package.

Algorithm:
1. Start the Program.
2. Import the numpy and matplotlib package.
3. Defining the each x,y, and z axis.
4. Put the each axis on to the Variable.
5. Load the variable Scatter plot function.
6. Then show the 3d scatter plot.
7. Stop the Program.
Program:

import numpy as np
import matplotlib.pyplot as plt
from mpl toolkits import mplot3d
fig = plt.figure()
ax = plt.axes(projection = '3d')
z= np. linspace (0, 1, 100)
x = 2*np.sin (25 * z)
y = 2*np.cos (25*z)
c=x+y
ax.scatter (x,y,z,C=c)
ax.set_title ('3d Scatter Plot')
Plt.show()
Result:
Thus the Python program to plot the three
Dimensional plot is coded and executed Successfully.
7) Visualizing Geographic in with
Basemap

Aim:
To Implement Visualizing Geographic Data with
Basemap.

Algorithm:
1. Start the program.
2. Install the basemap Package Using pip install
basemap
3.import numpy, matplotlib and basemap packages.
4. We plot the globe with orthogonal function
5. Then show the Particular location with latitude.
6. The show the Coastline of the world map.
7.Install and import the share file and geopandas
Package.
8.Download the shape file of Indian Political map from
github website.
9. Stop the Program.
Program:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
Plt.figure(figsize=(8,8))
m=Basemap(Projection='ortho', resolution=
None,lat_0=20.5937, lon_0=78.9629)
m.bluemarble(Scale=0.5)

fig = plt.figure(figsize= (8,8))


m = Basemap(Projection= 'lcc', resolution= None,
width=8E6, height=8E6, lat_0= 8.5562,
Ion_0=77.9710)
m.etopo (Scale =0.5, alpha=0.5)
x,y = m(77.9710, 8.5562)
plt.plot(xy, 'ok', markersize = 5)
plt.text(x,y,'Nazareth', fontsize=12);
fig = plt.figure(figsize=(12,12))
m= Basemap()
m.drawCoastlines()
m.drawcoastlines(linewidth=1.0, line style = 'dashed',
color='red')
plt.title("Coast lines", font size=20)
plt.show()

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import Shapefile as shp
import from Shapely.geometry import Point

sns.set-style("white grid')
fp=r'india-Polygon.shp'
map_df = gpd.read_file(fp)
map_df.plot()
Result:
Thus the Python program to visualising Geographic
Data with Basemap was coded and executed
successfully.

You might also like