CS3361 - Data Science
CS3361 - Data Science
TECHNOLOGY
(Approved by AICTE, New Delhi & Affiliated to Anna University, Chennai)
Coimbatore to Pollachi Highway, Thamaraikulam post, Coimbatore -
642120
NAME :
DEGREE / BRANCH
: UNIVERSITY REG NO
YEAR : II YEAR
REGULATION : R 2021
ARJUN COLLEGE OF TECHNOLOGY
(Approved by AICTE, New Delhi & Affiliated to Anna University,
Chennai) Coimbatore to Pollachi Highway, Thamaraikulam post,
Coimbatore - 642120
BONAFIDE CERTIFICATE
Year : ………………………………………………………………
Semester : ……………………………………………………………....
Branch : ……………………………………………………...……….
Certified that this is a bonafide record of work done by the above student in the
“CS3352 - Foundations of Data Science” during the Academic year 2024 – 2025 ODD /
EVENsemester.
4 Reading data from text files, Excel and the web and exploring
various commands for doingdescriptive analytics on the Iris
data set.
5 Use the diabetes data set from UCI and Pima Indians
Diabetes data set for performing thefollowing.
a.Univariate analysis: Frequency, Mean, Median, Mode,
Variance, Standard Deviation,Skewness and Kurtosis.
b.Bivariate analysis: Linear and logistic regression modeling.
c.Multiple Regression analysis.
d.Also compare the results of the above analysis for the two
data sets.
a.Normal curves
bDensity and contour plots
c.Correlation and scatter plots
d.Histograms
e.Three-dimensional plotting.
Anaconda:
Jupyter NoteBook:
The Jupyter Notebook is an open-source web application that allows you to create and
share documents that contain live code, equations, visualizations, and narrative text. Its uses
include data cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning, and much more.
NumPy:
NumPy is a Python library used for working with arrays. It also has functions for
working in domain of linear algebra, fourier transform, and matrices.
SciPy:
SciPy is a scientific computation library that uses NumPy underneath. SciPy stands
for Scientific Python. It provides more utility functions for optimization, stats and signal
processing. Like NumPy, SciPy is open source so we can use it freely
NumPy, stands for Numerical Python, is used for the manipulation of elements of numerical
array data. SciPy, stands for Scientific Python, is used for numerical computations in Python.
Both these packages provide extended functionality to work with Python.
Statsmodels :
Statsmodels is a popular library in Python that enables us to estimate and analyze various
statistical models. It is built on numeric and scientific libraries like NumPy and SciPy. It
includes various models of linear regression like ordinary least squares, generalized least squares,
weighted least squares, etc
Pandas
Pandas are really powerful. They provide you with a huge set of important commands and
features which are used to easily analyze your data. We can use Pandas to perform various tasks
like filtering your data according to certain conditions, or segmenting and segregating the data
according to preference, etc.
1. Type “Anaconda Download” in Google Chrome.
7. Click “ New”
8. Click the Run button to get the output.
9. Sample programs
Result :
Thus the NumPy, SciPy, Jupyter, Statsmodels and Pandas packages are downloaded and
installed successfully.
Ex.No:2 Working with Numpy arrays
AIM:
To work with Numpy array using Jupyter Notebook.
Program 1:
Python program to demonstrate # basic
array characteristics import numpy as np
Output:
array is of type:
No. of dimensions: 2 Shape of
array: (2, 3) Size of array: 6
Array stores elements of type: int64
PROGRAM 2:
# Python program to demonstrate # array
creation techniques import numpy as np
newarr = arr.reshape(2, 2, 3)
# Flatten array
arr = np.array([[1, 2, 3], [4, 5, 6]])
flarr = arr.flatten()
A random array:
[[ 0.46829566 0.67079389]
[ 0.09079849 0.95410464]]
Original array:
[[1 2 3 4]
[5 2 4 2]
[1 2 0 1]]
Reshaped array:
[[[1 2 3]
[4 5 2]]
[[4 2 1]
[2 0 1]]]
Original array:
[[1 2 3]
[4 5 6]]
Fattened array:
[1 2 3 4 5 6]
PROGRAM3:
# Python program to demonstrate
# indexing in numpy
import numpy as np
# An exemplar array
arr = np.array([[-1, 2, 0, 4],
[4, -0.5, 6, 0],
[2.6, 0, 7, 8],
[3, -7, 4, 2.0]])
# Slicing array
temp = arr[:2, ::2]
print ("Array with first 2 rows and alternate"
"columns(0 and 2):\n", temp)
OUTPUT:
PROGRAM 4:
= np.array([1, 2, 5, 3])
# transpose of array
a = np.array([[1, 2, 3], [3, 4, 5], [9, 6, 0]])
Original array:
[[1 2 3]
[3 4 5]
[9 6 0]]
Transpose of array:
[[1 3 9]
[2 4 6]
[3 5 0]]
a = np.array([[1, 4, 2],
[3, 4, 6],
[0, -1, 5]])
# sorted array
print ("Array elements in sorted order:\n",
np.sort(a, axis = None))
# Creating array
arr = np.array(values, dtype = dtypes)
print ("\nArray sorted by names:\n",
np.sort(arr, order = 'name'))
Output:
Array elements in sorted order:
[-1 0 1 2 3 4 4 5 6]
Row-wise sorted array:
[[ 1 2 4]
[ 3 4 6]
[-1 0 5]]
Column wise sort by applying merge-sort:
[[ 0 -1 2]
[ 1 4 5]
[ 3 4 6]]
RESULT:
Thus the programs using numpy executed successfully.
Ex.No:3 Working with Pandas Data Frames
AIM:
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was
created by Wes McKinney in 2008.
Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pandas.DataFrame(mydataset)
print(myvar)
Create Labels
import pandas as pd
a = [1, 7, 2]
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by everyone including
Pandas.
import pandas as pd
df = pd.read_csv(' C:\Users\New\Desktop\AD8302\data.csv')
print(df.to_string())
RESULT:
Thus the program using Data Frames were executed successfully.
Ex.No:4 Reading data from text files , Excel and the web and
exploring various commands for doing descriptive
analysis on the Iris data set.
AIM:
To Read data from text files , Excel and the web and exploring various commands for doing
descriptive analysis on the Iris data set.
Iris Dataset
Iris Dataset is considered as the Hello World for data science. It contains five columns namely
– Petal Length, Petal Width, Sepal Length, Sepal Width, and Species Type. Iris is a flowering
plant, the researchers have measured various features of the different iris flowers and recorded
them digitally.
import pandas as pd
Output:
Output:18
(150, 6)
The dataframe contains 6 columns and 150 rows
2. Info(): the columns and their data types
df.info()
Output:
3. Describe():The describe() function applies basic statistical computations on the dataset like
extreme values, count of data points standard deviation, etc. Any missing value or NaN value
is automatically skipped. describe() function gives a good picture of the distribution of data
df.describe()
Output:
2. Checking Duplicates:
Pandas drop_duplicates() method helps in removing duplicates from the data frame
data = df.drop_duplicates(subset ="class")
Output:
3. Count:
Series.value_counts() function. This function returns a Series containing counts of
unique values.
df.value_counts("class")
Output:
III. Data Visualization: We will use Matplotlib and Seaborn library for the data visualization.
Matplotlib is easy to use and an amazing visualizing library in Python. It is built on NumPy
arrays and designed to work with the broader SciPy stack and consists of several plots like
line, bar, scatter, histogram, etc
Seaborn is a library mostly used for statistical plotting in Python. It is built on top of
Matplotlib and provides beautiful default styles and color palettes to make statistical plots
more attractive
Hue: Hue parameter denotes which column decides the kind of color.
Legend(): A legend() is an area describing the elements of the graph.
Bounding Box: bbox_to_anchor=[x0, y0] will create a bounding box with lower left corner at
position [x0, y0] . The legend will then be placed 'inside' this box and overlapp it according to
the specified loc parameter.
Loc:The attribute Loc in legend() is used to specify the location of the legend. Default value of
loc is loc=”best” (upper left)
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.scatterplot(x='sepallength', y='sepalwidth',
hue='class', data=df, )
plt.show()
Output:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.scatterplot(x='petallength', y='petalwidth',
hue='class', data=df, )
plt.show()
Output:
Histograms
Histograms allow seeing the distribution of data for various columns. It can be used for uni as
well as bi-variate analysis.
Example:
import seaborn as sns
import matplotlib.pyplot as plt
axes[0,0].set_title("Sepal Length")
axes[0,0].hist(df['sepallength'], bins=7)
axes[0,1].set_title("Sepal Width")
axes[0,1].hist(df['sepalwidth'], bins=5);
axes[1,0].set_title("Petal Length")
axes[1,0].hist(df['petallength'], bins=6);
axes[1,1].set_title("Petal Width")
axes[1,1].hist(df['petalwidth'], bins=6);
OUTPUT:
Handling Correlation
Pandas dataframe.corr() is used to find the pairwise correlation of all columns in the
dataframe. Any NA values are automatically excluded. For any non-numeric data type
columns in the dataframe it is ignored.
OUTPUT:
Box Plots
We can use boxplots to see how the categorical value os distributed with other numerical
values.
Example:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
def graph(y):
sns.boxplot(x="class", y=y, data=df)
plt.figure(figsize=(10,10))
plt.subplot(222)
graph('sepalwidth')
plt.subplot(223)
graph('petallength')
plt.subplot(224)
graph('petalwidth')
plt.show()
Handling Outliers
An Outlier is a data-item/object that deviates significantly from the rest of the (so-called
normal)objects. They can be caused by measurement or execution errors. The analysis for outlier
detection is referred to as outlier mining. There are many ways to detect the outliers, and the
removal process is the data frame same as removing a data item from the panda’s dataframe.
Let’s consider the iris dataset and let’s plot the boxplot for the SepalWidthCm column.
Example:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(x='sepalwidth', data=df)
OUTPUT:
Removing Outliers
For removing the outlier, one must follow the same process of removing an entry from the
dataset using its exact position in the dataset because in all the above methods of detecting the
outliers end result is the list of all those data items that satisfy the outlier definition according
to the method used.
Example: We will detect the outliers using IQR and then we will remove them. We will also
draw the boxplot to see if the outliers are removed or not.
import sklearn
from sklearn.datasets import load_boston
import pandas as pd
import seaborn as sns
import numpy as np
# IQR
Q1 = np.percentile(df['sepalwidth'], 25,
interpolation = 'midpoint')
Q3 = np.percentile(df['sepalwidth'], 75,
interpolation = 'midpoint')
IQR = Q3 - Q1
# Upper bound
upper = np.where(df['sepalwidth'] >= (Q3+1.5*IQR))
# Lower bound
lower = np.where(df['sepalwidth'] <= (Q1-1.5*IQR))
sns.boxplot(x='sepalwidth', data=df)
Output:
Various Commands in data frame:
1. pandas.DataFrame
pandas.DataFrame() used to create a DataFrame in pandas. There are two ways to use this
function. You can form a DataFrame column-wise by passing a dictionary into
the pandas.DataFrame() function. Here, each key is a column, while the values are the rows:
import pandas
DataFrame = pandas.DataFrame({"A" : [1, 3, 4], "B": [5, 9, 12]})
print(DataFrame)
A B
0 1 5
1 3 9
2 4 12
import pandas as pd
df = pd.read_csv("iris_csv.csv")
print(df)
We can also compute the central tendencies of each column in a DataFrame using
pandas.:
DataFrame.mean()
df.median()
df.mode()
4. DataFrame.transform
This function returns a Boolean value and flags all rows containing null values as True:
df.isnull().sum()
sepallength 0
sepalwidth 0
petallength 0
petalwidth 0
class 0
dtype: int64
6. Dataframe.info
It returns the summary of non-missing values for each column instead:
df.info()
df.describe()
7. DataFrame.loc
loc to used find the elements in a particular index. To view all items in the third row, for
instance:
data=df.loc[2]
print(data)
8. DataFrame.max, min
df.min()
df.max()
9. DataFrame.astype
The astype() function changes the data type of a particular column or DataFrame.
DataFrame.astype(str)
10. DataFrame.insert
' insert() function used to add a new column to a DataFrame. It accepts three keywords,
the column name, a list of its data, and its location, which is a column index.
print(DataFrame)
11. DataFrame.sum
The sum() function in pandas returns the sum of the values in each column
DataFrame.cumsum()
12. Correlation:
Want to find the correlation between integer or float columns? pandas can help you
achieve that using the corr() function.
DataFrame.corr()
13. DataFrame.add
The add() function used to add a specific number to each value in DataFrame. It works
by iterating through a DataFrame and operating on each item.
DataFrame['A'].add(20)
14. DataFrame.sub
Like the addition function, you can also subtract a number from each value in a
DataFrame or specific column:
DataFrame['A'].sub(10)
15. DataFrame.mul
DataFrame['A'].mul(10)
16. DataFrame.div
DataFrame['A'].div(2)
17. DataFrame.std
Using the std() function, pandas also lets you compute the standard deviation for each
column in a DataFrame. It works by iterating through each column in a dataset and calculating
the standard deviation for each:
DataFrame.std()
18. DataFrame.melt
The melt() function in pandas flips the columns in a DataFrame to individual rows. It's
like exposing the anatomy of a DataFrame. So it lets you view the value assigned to each
column explicitly.
newDataFrame = DataFrame.melt()
print(newDataFrame)
19. DataFrame.pop
This function lets you remove a specified column from a pandas DataFrame. It accepts
an item keyword, returns the popped column, and separates it from the rest of the DataFrame:
DataFrame.pop(item= 'B')
print(DataFrame)
20. DataFrame.dropna
DataFrame.dropna(inplace = True)
print(DataFrame)
RESULT:
AIM :
To find Frequency, Mean, Median, Mode, Variance, Standard Deviation, Skewness and
Kurtosis using diabetes dataset.
(768, data
Read Diabetes 9) set:
Import
df.dtypespandas as pd
df =
Output:
pd.read_csv('diabetes.csv')
Pregnancies int64
df.head()
Glucose int64
df.shap
BloodPressure
e int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction float64
Age int64
Outcome int64
dtype: object
df['Outcome']=df['Outcome'].astype('bool') df.dtypes['Outcome']
Output:
dtype('bool')
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies 768 non-null int64
Glucose 768 non-null int64
BloodPressure 768 non-null int64
SkinThickness 768 non-null int64
Insulin 768 non-null int64
BMI 768 non-null float64
DiabetesPedigreeFunction 768 non-null float64
Age 768 non-null int64
Outcome 768 non-null bool
dtypes: bool(1), float64(2), int64(6)
memory usage: 48.9 KB
df.describe().T
Pregnency Propagation:
import numpy as np
preg_proportion =
np.array(df['Pregnancies'].value_counts()) preg_month =
np.array(df['Pregnancies'].value_counts().index)
preg_proportion_perc =
np.array(np.round(preg_proportion/sum(preg_proportion),3)
*100,dtype=int)
preg = pd.DataFrame({'month':
Pregnancies,'count_of_preg_prop':preg_proportion,'
percentage_proportion':preg_proportion_perc})
import seaborn as sns
import matplotlib.pyplot as plt
fig,axes = plt.subplots(nrows=3,ncols=2,dpi=120,figsize = (8,6))
plot00=sns.countplot('Pregnancies',data=df,ax=axes[0][0],color='green')
axes[0][0].set_title('Count',fontdict={'fontsize':8}) axes[0]
[0].set_xlabel('Month of Preg.',fontdict={'fontsize':7}) axes[0]
[0].set_ylabel('Count',fontdict={'fontsize':7})
plt.tight_layout()
plot01=sns.countplot('Pregnancies',data=df,hue='Outcome',ax=axes[0][1])
axes[0][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8}) axes[0]
[1].set_xlabel('Month of Preg.',fontdict={'fontsize':7}) axes[0]
[1].set_ylabel('Count',fontdict={'fontsize':7}) plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()
plot11 = df[df['Outcome']==False]['Pregnancies'].plot.hist(ax=axes[1][1],label='Non-Diab.')
plot11_2=df[df['Outcome']==True]['Pregnancies'].plot.hist(ax=axes[1][1],label='Diab.')
axes[1][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8}) axes[1]
[1].set_xlabel('Pregnancy Class',fontdict={'fontsize':7})
axes[1][1].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plot11.axes.legend(loc=1)
plt.setp(axes[1][1].get_legend().get_texts(), fontsize='6') # for legend text
plt.setp(axes[1][1].get_legend().get_title(), fontsize='6') # for legend title
plt.tight_layout()
plot21 = sns.boxplot(x='Outcome',y='Pregnancies',data=df,ax=axes[2][1])
axes[2][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8}) axes[2]
[1].set_xlabel('Pregnancy',fontdict={'fontsize':7}) axes[2]
[1].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
plt.tight_layout()
plt.show()
Understanding Distribution
The distribution of Pregnancies in data is unimodal and skewed to the right, centered at about 1
with most of the data between 0 and 15, A range of roughly 15, and outliers are present on the
higher end.
Glucose Variable
df.Glucose.describe()
Output:
count 768.000000
mean 120.894531
std 31.972618
min 0.000000
25% 99.000000
50% 117.000000
75% 140.250000
max 199.000000
Name: Glucose, dtype: float64
#sns.set_style('darkgrid')
fig,axes = plt.subplots(nrows=2,ncols=2,dpi=120,figsize = (8,6))
plot00=sns.distplot(df['Glucose'],ax=axes[0][0],color='green')
#axes[0]
[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0][0].set_title('Distribution of
Glucose',fontdict={'fontsize':8}) axes[0][0].set_xlabel('Glucose
Class',fontdict={'fontsize':7}) axes[0]
[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()
plot01=sns.distplot(df[df['Outcome']==False]['Glucose'],ax=axes[0]
[1],color='gre en',label='Non Diab.') sns.distplot(df[df.Outcome==True]
['Glucose'],ax=axes[0][1],color='red',label='Di ab')
axes[0][1].set_title('Distribution of
Glucose',fontdict={'fontsize':8}) axes[0][1].set_xlabel('Glucose
Class',fontdict={'fontsize':7}) axes[0]
[1].set_ylabel('Count/Dist.',fontdict={'fontsize':7}) #axes[0]
[1].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plot01.axes.legend(loc=1) plt.setp(axes[0]
[1].get_legend().get_texts(), fontsize='6') plt.setp(axes[0]
[1].get_legend().get_title(), fontsize='6') plt.tight_layout()
plot10=sns.boxplot(df['Glucose'],ax=axes[1][0],orient='v') axes[1]
[0].set_title('Numerical Summary',fontdict={'fontsize':8}) axes[1]
[0].set_xlabel('Glucose',fontdict={'fontsize':7}) axes[1]
[0].set_ylabel(r'Five Point Summary(Glucose)',fontdict={'fontsize':7})
plt.tight_layout()
plot11=sns.boxplot(x='Outcome',y='Glucose',data=df,ax=axes[1][1])
axes[1][1].set_title(r'Numerical Summary
(Outcome)',fontdict={'fontsize':8}) axes[1][1].set_ylabel(r'Five Point
Summary(Glucose)',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
axes[1][1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()
plt.show()
Understanding Distribution
The distribution of Glucose level among patients is unimodal and roughly bell shaped, centered
at about 115 with most of the data between 90 and 140, A range of roughly 150, and outliers are
present on the lower end(Glucose ==0).
plot0=sns.distplot(df[df['Glucose']!=0]['Glucose'],ax=axes[0],color='green')
#axes[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0].set_title('Distribution of Glucose',fontdict={'fontsize':8})
axes[0].set_xlabel('Glucose Class',fontdict={'fontsize':7})
axes[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()
plot1=sns.boxplot(df[df['Glucose']!=0]['Glucose'],ax=axes[1],orient='v')
axes[1].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1].set_xlabel('Glucose',fontdict={'fontsize':7}) axes[1].set_ylabel(r'Five
Point Summary(Glucose)',fontdict={'fontsize':7}) plt.tight_layout()
Blood Pressure variable
df.BloodPressure.describe()
count 768.000000
mean 69.105469
std 19.355807
min 0.000000
25% 62.000000
50% 72.000000
75% 80.000000
max 122.000000
Name: BloodPressure, dtype:
float64
plot00=sns.distplot(df['BloodPressure'],ax=axes[0]
[0],color='green') axes[0]
[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f')) axes[0]
[0].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0][0].set_xlabel('BP Class',fontdict={'fontsize':7}) axes[0]
[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()
plot01=sns.distplot(df[df['Outcome']==False]['BloodPressure'],ax=axes[0]
[1],colo r='green',label='Non Diab.') sns.distplot(df[df.Outcome==True]
['BloodPressure'],ax=axes[0][1],color='red',lab el='Diab')
axes[0][1].set_title('Distribution of
BP',fontdict={'fontsize':8}) axes[0][1].set_xlabel('BP
Class',fontdict={'fontsize':7}) axes[0]
[1].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
axes[0]
[1].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plot01.axes.legend(loc=1) plt.setp(axes[0]
[1].get_legend().get_texts(), fontsize='6') plt.setp(axes[0]
[1].get_legend().get_title(), fontsize='6')
plt.tight_layout()
axes[1][0].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1][0].set_xlabel('BP',fontdict={'fontsize':7}) axes[1]
[0].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.tight_layout()
plot11=sns.boxplot(x='Outcome',y='BloodPressure',data=df,ax=axes[1]
[1]) axes[1][1].set_title(r'Numerical Summary
(Outcome)',fontdict={'fontsize':8}) axes[1][1].set_ylabel(r'Five
Point Summary(BP)',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
axes[1][1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()
plt.show()
Understanding Distribution
The distribution of BloodPressure among patients is unimodal (This is not a bimodal because
BP=0 does not make any sense and it is Outlier) and bell shaped, centered at about 65 with
most of the data between 60 and 90, A range of roughly 100, and outliers are present on the
lower end(BP ==0).
import os
import pandas as
pd import random
import matplotlib.pyplot as
plt import seaborn as sns
import numpy as np
os.chdir("C:/Users/Administrator/
Desktop/DS") df =
pd.read_csv('diabetes.csv')
df.head()
sns.scatterplot(df.DiabetesPedigreeFunction,df.Glucose)
plt.ylim(0,20000)
sns.scatterplot(df.BMI,df.A
ge) plt.ylim(0,20000)
sns.scatterplot(df.BloodPressure,df.Gluco
se) plt.ylim(0,20000)
plt.figure(figsize=(12,8))
sns.kdeplot(data=df,x=df.Glucose,hue=df.Outcome,fill=True)
5C) Multiple Regression analysis
df.isnull().values.any()
False
(df.Pregnancies == 0).sum(),(df.Glucose==0).sum(),(df.BloodPressure==0).sum(),
(df.SkinThickness==0).sum(),(df. Insulin==0).sum(),(df.BMI==0).sum(),
(df.DiabetesPedigreeFunction==0).sum(),(df.Age==0).su m()
## Counting cells with 0 Values for each variable and publishing the counts below
Output:
(111, 5, 35, 227, 374, 11, 0, 0)
drop_Glu=df.index[df.Glucose == 0].tolist()
drop_BP=df.index[df.BloodPressure == 0].tolist()
drop_Skin = df.index[df.SkinThickness==0].tolist()
drop_Ins = df.index[df.Insulin==0].tolist()
drop_BMI = df.index[df.BMI==0].tolist()
c=drop_Glu+drop_BP+drop_Skin+drop_Ins+drop_BMI
dia=df.drop(df.index[c])
dia.info()
Output:
class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 3 to 765
Data columns (total 9 columns):
# Column Non-Null Count Dtype
- -
0 Pregnancies 392 non-null int64
1 Glucose 392 non-null int64
2 BloodPressure 392 non-null int64
3 SkinThickness 392 non-null int64
4 Insulin 392 non-null int64
5 BMI 392 non-null float64
6 DiabetesPedigreeFunction 392 non-null float64
7 Age 392 non-null int64
8 Outcome 392 non-null bool
dtypes: bool(1), float64(2), int64(6)
memory usage: 27.9 KB
OUTPUT
sns.heatmap(cor)
RESULT:
Thus the univariate ,bivariate, multivariate analysis performed successfully.
Ex.No:6 Apply and explore various plotting functions on UCI
data sets
AIM:
Apply and explore various plotting functions on UCI data sets.
To load and quickly visualize the Multiple Features Dataset [1] from the UCI repository, which
is available in mvlearn. This dataset can be a good tool for analyzing the effectiveness of
multiview algorithms. It contains 6 viewsof handwritten digit images, thus allowing for
analysis of multiview algorithms in multiclass or unsupervised tasks.
A. NORMAL CURVES
A probability distribution is a statistical function that describes the likelihood of obtaining the
possible values that a random variable can take. By this, we mean the range of values that a
parameter can take when we randomly pick upvalues from it. If we were asked to pick up 1
adult randomly and asked what his/her (assuming gender does not affect height) height would
be? There’s no way to know what the height will be. But if we have the distribution of heights
of adults in the city, we can bet on the most probable outcome.A Normal Distribution is also
known as a Gaussian distribution or famously Bell Curve. People use both words
interchangeably, but it means the same thing. It is a continuous probability distribution.
Code:
import numpy as np
range of 1-50. x =
np.linspace(1,50,200)
#Creating a Function.
sd = np.std(x)
data. pdf =
normal_dist(x,mean,sd
plt.plot(x,pdf , color =
'red') plt.xlabel('Data
points')
plt.ylabel('Probability Density')
Contour plots also called level plots are a tool for doing multivariate analysis and visualizing 3-
D plots in 2-D space. If we consider X and Y as our variables we want to plot then the response
Z will be plotted as slices on the X-Y plane due to which contours are sometimes referred as Z-
slices or iso-response.
Contour plots are widely used to visualize density, altitudes or heights of the mountain as well
as in the meteorological department. Due to such wide usage matplotlib.pyplot provides
method contour to make it easy for us to draw contour plots.
CODE:
import
matplotlib.pyplot as plt
import numpy as np
feature_y =
features
[X, Y] = np.meshgrid(feature_x,
1)
Z = np.cos(X / 2) +
contour lines
ax.contour(X, Y, Z)
ax.set_title('Contou
r Plot')
ax.set_xlabel('featu
re_x')
ax.set_ylabel('featu
re_y') plt.show()
C. CORRELATION AND SCATTER PLOTS
Correlation means an association, It is a measure of the extent to which two variables are related.
1. Positive Correlation: When two variables increase together and decrease together.
They are positively correlated. ‘1’ is a perfect positive correlation. For example – demand and
profit are positively correlated the more the demand for the product, the more profit hence
positive correlation.
2. Negative Correlation: When one variable increases and the other variable decreases
together and vice- versa. They are negatively correlated. For example, If the distance between
magnet increases their attraction decreases, and vice-versa. Hence, a negative correlation. ‘-1’
is no correlation
3. Zero Correlation( No Correlation): When two variables don’t seem to be linked at all. ‘0’ is
a perfect
negative correlation. For Example, the amount of tea you take and level of intelligence.
CODE:
import pandas as pd
con =
pd.read_csv('concrete.csv
') con
list(con.columns)
con.head()
con['cement'] =
con['cement'].astype('category')
con.describe(include='category')
import seaborn as sns
sns.scatterplot(x="water", y="coarseagg", data=con);
ax = sns.scatterplot(x="water", y="coarseagg",
D. HISTOGRAMS:
A histogram is basically used to represent data provided in a form of some groups.It is accurate
method for the graphical representation of numerical data distribution.It is a type of bar plot
where X-axis represents the bin ranges while Y-axis gives information about frequency.
CREATING A HISTOGRAM
To create a histogram the first step is to create bin of the ranges, then distribute the whole
range of the values into a series of intervals, and count the values which fall into each of the
intervals.Bins are clearly identified as consecutive, non-overlapping intervals of variables. The
matplotlib.pyplot.hist() function is used to compute and create histogram of x.
CODE:
pltimport numpy as np
# Creating dataset
# Creating histogram
plt.show()
CODE:
import matplotlib.pyplot as
pltimport numpy as np
np.random.seed(23685752)
N_points = 10000
n_bins = 20
# Creating distribution
x = np.random.randn(N_points)
y = .8 ** x + np.random.randn(10000)
# Show
plot
plt.show()
Matplotlib was introduced keeping in mind, only two-dimensional plotting. But at the time
when the release of 1.0 occurred, the 3d utilities were developed upon the 2d and thus, we have
3d implementation of data available today! The 3d plots are enabled by importing the mplot3d
toolkit. In this article, we will deal with the 3d plots using matplotlib.
Code:
mplot3dimport numpy as np
fig = plt.figure()
# syntax for 3-D projection
ax = plt.axes(projection ='3d')
# defining axes
z = np.linspace(0, 1, 100)
x = z * np.sin(25 * z)
y = z * np.cos(25 * z)
c=x+y
ax.scatter(x, y, z, c =
plt.show()
RESULT:
Thus the various plots are executed and plotted successfully.
Ex.No:7 Visualizing Geographic Data with Basemap
AIM:
To Visualizing Geographic Data with Basemap
One common type of visualization in data science is that of geographic data. Matplotlib's main
tool for this type of visualization is the Basemap toolkit, which is one of several Matplotlib
toolkits which lives under the mpl_toolkits namespace. Admittedly, Basemap feels a bit clunky
to use, and often even simple visualizations take much longer to render than you might hope.
More modern solutions such as leaflet or the Google Maps API may be a better choice for
more intensive map visualizations. Still, Basemap is a useful tool for Python users to have in
their virtual toolbelts. In this section, we'll show several examples of the type of map
visualization that is possible with this toolkit.
Installation of Basemap is straightforward; if you're using conda you can type this and the
package will be downloaded:
Code:
m=
Basemap(projection='lcc',
resolution=None,
width=8E6, height=8E6,
lat_0=45, lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5)
y) for plotting x, y =
m(-122.3, 47.6)
plt.plot(x, y, 'ok',
markersize=5)
plt.text(x, y, '
Seattle',
fontsize=12);
from mpl_toolkits.basemap
import Basemap
import matplotlib.pyplot as
plt
fig =
plt.figure(figsize =
(12,12)) m =
Basemap()
m.drawcoastlines()
m.drawcoastlines(linewidth=1.0,
linestyle='dashed', color='red')
plt.title("Coastlines", fontsize=20)
plt.show()
import numpy as np
import pandas as pd
from
shapely.geometry
map_df_copy = gpd.read_file(fp)
plt.plot(map_df , markersize=5)
RESULT :
Thus the program using Basemap was installed and successfully executed geographic
visualization.