[go: up one dir, main page]

0% found this document useful (0 votes)
17 views39 pages

Data Science Cs3362 Lab Record

The document is a laboratory record for the CS3362 Data Science Laboratory course at CSI College of Engineering for the academic year 2024-2025. It includes various experiments related to data science, such as installing Python packages, working with NumPy arrays, and performing descriptive analytics on the Iris dataset using Pandas. Each section outlines the aims, algorithms, and sample programs for practical exercises.

Uploaded by

kiruthikaa0612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views39 pages

Data Science Cs3362 Lab Record

The document is a laboratory record for the CS3362 Data Science Laboratory course at CSI College of Engineering for the academic year 2024-2025. It includes various experiments related to data science, such as installing Python packages, working with NumPy arrays, and performing descriptive analytics on the Iris dataset using Pandas. Each section outlines the aims, algorithms, and sample programs for practical exercises.

Uploaded by

kiruthikaa0612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

CSI COLLEGE OF ENGINEERING

KETTI, THE NILGIRIS-643215

LABORATORY RECORD

DEPARTMENT OF COMPUTER SCIENCE AND


ENGINEERING

CS3362 DATA SCIENCE LABORATORY


YEAR 2024-2025
CSI COLLEGE OF ENGINEERING

KETTI, THE NILGIRIS- 643 215.

Register No……………………………………………………………
Department……………………………………………………………

This is to certify that Mr. /Ms ………………………………….. with the register


no...................................................... has successfully completed the practical course
of ………………………………………....... Laboratory prescribed by Anna
University, Chennai for ……. Semester B.E. Examination for the academic year
2024-2025.

Staff In-Charge Head of the Department

Submitted to the Anna University, Chennai B.E. Examination for ………….. Semester,
conducted on ………………..

Internal Examiner External Examiner


INDEX

PAGE MARKS
Sl. No DATE NAME OF THE EXPERIMENT NO (20) Signature

To learn how to download and install


1 1-2
the different packages of NumPy, SciPy,
Jupyter, Statsmodels and Pandas.

2 Working With Numpy Arrays 3-6

3 working with pandas data frames 7–9

Reading data from iris dataset and doing


4 descriptive analytics on the iris data set
10 – 14

Perform univariate analysis on the


5 diabetes data set
15-20

Perform Bivariate Analysis On The 21-23


6 Diabetes Data Set.

perform multiple regression analysis on


7 the diabetes data set
24 - 25
8 Apply and explore normal curves & Histograms 26 - 27
plotting functions on uci-iris data sets

9 density and contour plotting functions 28 – 29


on uci-iris data sets

correlation and scatter plotting


10 functions on uci data sets
30 – 31

visualizing geographic data with


11 basemap
32– 36
DOWNLOAD AND INSTALL THE DIFFERENT PACKAGES LIKE
NUMPY, SCIPY, JUPYTER, STATSMODELS AND PANDAS

AIM:
To learn how to download and install the different packages of NumPy, SciPy,
Jupyter, Statsmodels and Pandas.

ALGORITHM:

1. Download Python and Jupyter.


2. Install Python and Jupyter.
3. Install the pack like NumPy, SciPy Satsmodels and Pandas.
4. Verify the proper execution of Python and Jupyter.

Python Installation

● Open the python official website. (https://www.python.org/)


● Downloads ==> Windows ==> Select Recent Release. (Requires Windows 10
or above versions)
● Install "python-3.10.6-amd64.exe"

Jupyter Installation

● Open command prompt and enter the following to check whether the pyton
was installed properly or not, “python –version”.
● If installation is proper it returns the version of python
● Enter the following to check whether the pyton package manager was
installed properly or not, “pip –version”
● If installation is proper it returns the version of python package manager
● Enter the following command “pip install jupyterlab”.
● Enter the following command “pip install jupyter notebook”.
● Copy the above command result from path to upgrade command and paste it
and execute for upgrade process.
● Create a folder and name the folder accordingly.
● Open command prompt and enter in to that folder. Enter the following code
“jupyter notebook” and then give enter.
● Now new jupyter notebook will be opened for our use.

pip Installation

Installation of NumPy
● pip install numpy
Installation of SciPy
● pip install scipy
Installation of Statsmodels
● pip install statsmodels
Installation of Pandas
● pip install pandas
1
Sample Output

RESULT:
NumPy, SciPy, Jupyter, Statsmodels and Pandas packages were installed
properly and the execution also verified

2
WORKING WITH NUMPY ARRAYS

Aim:
To implement array object using Numpy module in Python
programming

Algorithm
Step 1: Start the program
Step 2: Import the required packages
Step 3: Read the elements through list/tuple/dictionary
Step 4: Convert List/tuple/dictionary into array using built-in methods.
Step 5: Check the number of dimensions in an array
Step 6: Compute the shape of an array or if it’s required reshape an array
Step 7: Do the required operations like slicing, iterating, searching, concatenating and
splitting an array element.
Step 8: Stop the program

(i) Create a NumPy ndarray Object


Program
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
print(type(arr))
Output
[1 2 3 4 5]
<class 'numpy.ndarray'>

(ii) Dimensions in Arrays 0-D Arrays


Program
import numpy as np
arr = np.array(42)
print(arr)
Output42
31-D Arrays
Program
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
Output [1 2 3 4 5]

2-D Arrays
Program
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)
Output[[1
2 3]
[4 5 6]]

3
3-D arrays
Program
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2,
3], [4, 5, 6]]]) print(arr)
Output
[[[1 2 3]
[4 5 6]]
[[1 2 3]
[4 5 6]]]

(iii) Check Number of Dimensions?


Program
import numpy as npa
= np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
4print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)
Output0
1
2
3\

(iv) Access Array Elements


Program
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[0])
Output
1

Program
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[2] + arr[3])
Output
7

(v) Slicing arrays Program


import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5])
Output
[2 3 4 5]

(vi) NumPy Array Shape

4
Program
import numpy as np
arr = np.array([[1, 2, 3,
4], [5, 6, 7, 8]])
print(arr.shape)
5Output(2,
4)

(vii) Reshaping arrays


Program
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12]) newarr =
arr.reshape(4, 3)
print(newarr)
Output
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]

(viii) Iterating Arrays


Program
import numpy as np arr
= np.array([1, 2, 3])for x
in arr:
print(x)
Output
1
2
3

(ix) Joining NumPy Arrays


Program
import numpy as np arr1
= np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr =
np.concatenate((arr1
, arr2)) print(arr)
Output
[1 2 3 4 5 6]

(x) Splitting NumPy Arrays


Program
import numpy as np
arr = np.array([1, 2, 3, 4,
5, 6])newarr =
np.array_split(arr,

5
3)print(newarr)
Output
[array([1, 2]), array([3, 4]), array([5, 6])]

(xi) Searching Arrays


Program
import numpy as np
arr = np.array([1, 2,
3, 4, 5, 4, 4])x =
np.where(arr == 4)
print(x)
Output (array([3, 5, 6]),)

(xii) Sorting ArraysProgram


import numpy as np
arr = np.array([3, 2, 0, 1])
print(np.sort(arr))
Output
[0 1 2 3]

RESULT:
Thus Array objects have been explored using the Numpy module in Python
programming successfully.

6
WORKING WITH PANDAS DATA FRAMES

Aim:
To work with Data Frame object using Pandas module in Python Programming

Algorithm:
Step 1: Start the program
Step 2: Import the required packages
Step 3: Create a Data Frame using built in methods.
Step 4: Load data into a Dataframe object otherwise Load Files(excel/csv)
into DataFrame
Step 5: Display the rows and describe the data set using the
built in method. Step 6: Display the last 5 rows of the Data
Frame.
Step 7: Check the number of maximum returned rows
Step 8: Stop the program

(i) Create a simple Pandas DataFrame:


Program
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
Output
calories duration
01
420 380
850 40
2 390 45
(ii) Locate Row
Program
print(df.loc[0])
Output
calories 420
duration 50
Name: 0, dtype: int64
Note: This example returns a
Pandas Series.
(iv )use a list of indexes:
Program
print(df.loc[[0, 1]])
Output
calories duration
0 420 50
1 380 40

7
Note: When using [], the result is a Pandas DataFrame.
(v) Named Indexes
Program
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1",
"day2", "day3"])
print(df)
Output
calories duration
9day1 420 50
day2 380 40
day3 390 45
(vi) Locate Named Indexes
print(df.loc["day2"])
Output
calories 380
duration 40
Name: 0, dtype: int64
(vii) Load Files Into a DataFrame
Program
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
Output
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 2 60 60 117 103 145 135 479.0 340.0
3 45 109 175
282.4 4 45 117
148 406.0 20
.. ... ... ... ...
164 60 105 140 290.8
165 60 110 145 300.4
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4
[169 rows x 4 columns]
(viii) Check the number of maximum returned rows:
Program
import pandas as pd
print(pd.options.display.max_rows)
In my system the number is 60, which means that if the DataFrame
contains more than 60 rows, the print(df) statement will return only the
headers and the first andlast 5 rows.
import pandas as pd

8
pd.options.display.max_rows = 9999
df = pd.read_csv('data.csv')
print(df)
(ix) Viewing the Data
Program
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(4))
Output
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
Print the last 5 rows oftheDataFrame:
print(df.tail())
print(df.info())
Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
# Column Non-Null Count Dtype
110 Duration 169 non-null int64
1 Pulse 169 non-null int64
2 Maxpulse 169 non-null int64
3 Calories 164 non-null float64
dtypes: float64(1),int64(3)
memory usage: 5.4 KB
None

RESULT:
Thus Data Frame object using Pandas module in Python Programming has been
successfully explored.

9
READING DATA FROM IRIS DATA SET AND DOING
DESCRIPTIVE ANALYTICS ON THE IRIS DATA SET

Aim:
To perform descriptive analytics on Iris dataset using Python programming.

Algorithm:
Step 1: Start the program
Step 2: Import the required packages
Step 3: Load Files(excel/csv/ text) into a Dataframe from Iris data set Step 4: Display
the rows and describe the data set using built in methods
Step 5: Compare Petal Length and Petal Width
Step 6: Visualize the data set using histogram with distplot,heatmapsbox plots
methods
Step 7: Check Missing Values, Duplicates and remove outliers Step 8: Stop the
program
Program
import pandas as pd
# Reading the CSV file
df = pd.read_csv("Iris.csv")
# Printing top 5 rows
df.head()
Output:

Getting Information about the Dataset


df.shape Output: (150, 6)
df.info()

Output

10
df.describe()

Checking Missing Values


df.isnull().sum()

Checking Duplicates
data = df.drop_duplicates(subset ="Species",) data
Output

df.value_counts("Species")

Data Visualization
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x='Species', data=df, )
14plt.show()

11
Comparing Sepal Length and Sepal Width
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt sns.scatterplot(x='SepalLengthCm', y='SepalWidthCm',
hue='Species', data=df, )# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.show()

Output:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt sns.pairplot(df.drop(['Id'], axis = 1),
hue='Species', height=2)

12
Histograms Program
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 2, figsize=(10,10)) axes[0,0].set_title("Sepal Length")
axes[0,0].hist(df['SepalLengthCm'], bins=7) axes[0,1].set_title("Sepal Width")
axes[0,1].hist(df['SepalWidthCm'], bins=5) axes[1,0].set_title("Petal Length")
axes[1,0].hist(df['PetalLengthCm'], bins=6) axes[1,1].set_title("Petal Width")
axes[1,1].hist(df['PetalWidthCm'], bins=6)

Output:

Histograms with Distplot Plot Program


# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
plot = sns.FacetGrid(df, hue="Species") plot.map(sns.distplot,
"SepalLengthCm").add_legend() plot = sns.FacetGrid(df, hue="Species")
plot.map(sns.distplot, "SepalWidthCm").add_legend() plot = sns.FacetGrid(df,
hue="Species") plot.map(sns.distplot, "PetalLengthCm").add_legend() plot =
sns.FacetGrid(df, hue="Species") plot.map(sns.distplot, "PetalWidthCm").add_legend()
plt.show()

Output:

13
Program
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
#Load the dataset
df = pd.read_csv('Iris.csv')
sns.boxplot(x='SepalWidthCm', data=df)

Output:

RESULT:
Thus Iris dataset has been explored and descriptively analysed using Python
programming.

14
PERFORM UNIVARIATE ANALYSIS ON THE DIABETES DATA SET

AIM:
Use the diabetes data set from UCI and Pima Indians Diabetes data set for Univariate
analysis.
ALGORITHM:
1. Download diabetes data set from UCI and Pima Indians Diabetes data set.
2. Load the above data files into google colab.
3. Perform analysis like Frequency, Mean, Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis.
Univariate analysis
● The term univariate analysis refers to the analysis of one variable.
● There are three common ways to perform univariate analysis on one variable:
Summary statistics – Measures the center and spread of values.
1. Central tendency — mean, median, mode
2. Dispersion — variance, standard deviation, range, interquartile
range (IQR)
3. Skewness — symmetry of data along with mean value
4. Kurtosis — peakedness of data at mean value
5. Frequency table – Describes how often different values occur.
File Importing:
# Reading the UCI
file import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/Data_Science/UCI_diabetes.csv")
# Printing top 5
rows df.head()
# Reading the Pima file
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv")
# Printing top 5
rows df.head()

1. Central Tendency
We can use the following syntax to calculate various summary statistics like Mean,
19Median and Mode.
Mean:
It is average value of given numeric values
● Mean of UCI data
import pandas as pd
# Reading the UCI file
df = pd.read_csv("/content/drive/MyDrive/Data_Science/UCI_diabetes.cs v")
# Mean of UCI data
df.mean(axis=0)
● Mean of Pima data
import pandas as pd
# Reading the UCI
file df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv") #
Mean of Pima data
15
df.mean(axis=0)
Median:
It is middle most value of given values
● Median of UCI data
import pandas as pd
# Reading the UCI file
df = pd.read_csv("/content/drive/MyDrive/Data_Science/UCI_diabetes.csv") #
Median of UCI data
df.median(axis=0)
● Median of Pima data
import pandas as pd
# Reading the UCI file
df = pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv") #
Median of Pima data
df.median(axis=0)
Mode:
It is the most frequently occurring value of given numeric variables
● Mode of UCI data
import pandas as pd
# Reading the UCI
file df =
pd.read_csv("/content/drive/MyDrive/Data_Science/UCI_diabetes.csv") #
Median of UCI data
df.mode(axis=0)
● Mode of Pima data
import pandas as pd
# Reading the UCI
file df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv") #
Mean of Pima data
df.mode(axis=0)

2. Dispersion

Variance
The range is the difference between the maximum and minimum values of a data set.
Example
import pandas as pd
# Reading the UCI
file df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv") #
variance of the BMI column
df.loc[:,"BMI"].var()

Standard deviation
Standard deviation is a measure of how spread out the numbers are. A large
standard deviation indicates that the data is spread out, - a small standard deviation
indicates that the data is clustered closely around the mean.

16
Example
import pandas as pd
# Reading the UCI
file df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv")
# Standard deviation of the BMI column
df.loc[:,"BMI"].std()
Range
Range is the simplest of the measurements but is very limited in its use, we
21calculate the range by taking the largest value of the dataset and subtract the
smallest value
from it, in other words, it is the difference of the maximum and minimum values of a
dataset.
Example
df=pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.cs v")
print("Range is:",df.BloodPressure.max()-df.BloodPressure.min())

Interquartile range
The interquartile range, often denoted “IQR”, is a way to measure the spread of the
middle 50% of a dataset. It is calculated as the difference between the first quartile*
(the 25th percentile) and the third quartile (the 75th percentile) of a dataset.
Example
# Importing important libraries
import numpy as np
import pandas as pd
import seaborn as
sns
import matplotlib.pyplot as plt
plt.style.use('seaborn')
data=pd.read_csv('/content/drive/MyDrive/Data_Science/Pima_diabetes.csv')
# Removing the outliers
def removeOutliers(data, col):
Q3 = np.quantile(data[col],
0.75) Q1 =
np.quantile(data[col], 0.25)
IQR = Q3 - Q1
print("IQR value for column %s is: %s" % (col, IQR))
global outlier_free_list
global filtered_data
lower_range = Q1 - 1.5 * IQR
upper_range = Q3 + 1.5 * IQR
outlier_free_list = [x for x in data[col] if
(
(x > lower_range) & (x < upper_range))]
filtered_data = data.loc[data[col].isin(outlier_free_list)]
for i in data.columns:
if i == data.columns[0]:
removeOutliers(data, i)
else:
17
removeOutliers(filtered_data, i)
# Assigning filtered data back to our original variable
data = filtered_data
print("Shape of data after outlier removal is: ", data.shape)

3. Skewness

● Skewness essentially measures the symmetry of the distribution.


Example
# importing pandas as
pd import pandas as pd
# Creating the dataframe
df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv")
# skip the na values
# find skewness in each row
df.skew(axis = 0, skipna =
True)

4. kurtosis
kurtosis determines the heaviness of the distribution tails.
Example
import pandas as
pd df =
pd.read_csv('/content/drive/MyDrive/Data_Science/Pima_diabetes.csv')
df['BloodPressure'].kurtosis()

5. Frequency
Frequency is a count of the number of occurrences a particular value occurs or
appears in our data. A frequency table displays a set of values along with the
frequency with which they appear. They allow us to better understand which data
values are common and which are uncommon.
Example
# import packages
import pandas as
pd import numpy
as np # reading csv file
data = pd.read_csv('/content/drive/MyDrive/Data_Science/Pima_diabetes.csv')
# one way frequency table for the species column.
freq_table = pd.crosstab(data['Age'], 'BMI')
23# frequency table in proportion of species
freq_table= freq_table/len(data)
freq_table

Sample Output

18
19
RESULT
Thus the Univariate analysis on the Diabetes data of UCI and Pima was performed
successfully

20
PERFORM BIVARIATE ANALYSIS ON THE DIABETES DATA SET

AIM:
To use the UCI and Pima Indians Diabetes data set for Bivariate analysis.

ALGORITHM:

1. Download diabetes data set from UCI and Pima Indians Diabetes data set.
2. Load the above data files into google colab.
3. Perform various methods of bivariate.

Bivariate analysis
The term bivariate analysis refers to the analysis of two variables. The purpose of
bivariate analysis is to understand the relationship between two variables

There are three common ways to perform bivariate analysis:


1. Scatterplots
2. Correlation Coefficients
3. Simple Linear Regression
1. Scatterplots
A scatterplot is a type of data display that shows the relationship between two
numerical variables

Example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# import packages
data =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv") #
Diabetes Outcome
g1 = data.loc[data.Outcome==1,:]
# Pregnancies, Glucose and Diabetes relation
g1.plot.scatter('Pregnancies', 'Glucose');

2. Correlation Coefficients
The correlation coefficient is a statistical measure of the strength of the relationship
between the relative movements of two variables. The values range between -1.0 and
1.0. Correlation of -1.0 shows a perfect negative correlation, while a correlation of 1.0
shows a perfect positive correlation. A correlation of 0.0 shows no linear relationship
between the movement of the two variables.

Example
# Import those libraries import
pandas as pd
from scipy.stats
import pearsonr
# Import your data into Python
21
df = pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv") #
Convert dataframe into series
list1 = df['BloodPressure']
list2 = df['SkinThickness']
# Apply the pearsonr()
corr, _ = pearsonr(list1, list2) print('Pearsons
correlation: %.3f' % corr)

3. Simple Linear Regression


Simple linear regression is a statistical method that we can use to find a relationship
between two variables and make predictions. The two variables used are typically
denoted as y and x. The independent variable, or the variable used to predict the
dependent variable is denoted as x. The dependent variable, or the outcome/output,
is denoted as y. A simple linear regression model will produce a line of best fit, or the
regression line. You may have heard about drawing the line of best fit through a
scatter plot of data.

Example
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset =
pd.read_csv('/content/drive/MyDrive/Data_Science/Pima_diabetes.csv') X
= dataset.iloc[:, :-1].values #get a copy of dataset exclude last column y =
dataset.iloc[:, 1].values #get array of dataset in column 1st
# Splitting the dataset into the Training set and Test set from
sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3,
random_state=0)
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

Sample Output

22
RESULT:
Thus the Bivariate analysis on the diabetes data set was executed successfully.

23
PERFORM MULTIPLE REGRESSION ANALYSIS ON THE DIABETES DATA SET

AIM:
To use UCI and Pima Indians Diabetes data set for Multiple Regression Analysis.

ALGORITHM:
1. Download diabetes data set from UCI and Pima Indians Diabetes data set.
2. Load the above data files into google colab.
3. Perform multiple regression analysis on data sets.

Multiple Regression Analysis


Multiple regression is like linear regression, but with more than one independent
value, meaning that we try to predict a value based on two or more variables.

Example
# Pima_diabetes
import pandas
from sklearn import linear_model df
=
pandas.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.cs v")
X = df['Pregnancies ', 'Glucose '] y
= df['BloodPressure ']
regr = linear_model.LinearRegression()
regr.fit(X, y)
#predict the Blood Pressure based on Pregnancies and Glucose level:
predictedBP = regr.predict([[4, 120]])
print(predictedBP)
# UCI-Diabetes
import pandas
from sklearn import linear_model
df = pandas.read_csv("("/content/drive/MyDrive/Data_Science/UCI_diabetes.
csv")
X = df[['Time', 'Code']]
y = df['Value']
regr = linear_model.LinearRegression()
regr.fit(X, y)
#predict the Diabetes based on Time and Code:
predictedBP = regr.predict([[13:23, 46]])
print(predictedBP)

Sample Output
24
RESULT :
Thus the Multiple Regression analysis on the Diabetes data of UCI and Pima was
performed successfully.

25
APPLY AND EXPLORE NORMAL CURVES & HISTOGRAMS PLOTTING
FUNCTIONS ON UCI-IRIS DATA SETS

AIM:
To apply and explore Normal curves & Histograms plotting functions on UCI-Iris
data sets.

ALGORITHM:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the normal curve and Histograms for Iris data set.

Normal Curves
It is a probability function used in statistics that tells about how the data values are
distributed. It is the most important probability distribution function used in
statistics because of its advantages in real case scenarios.
Example
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import statistics
# import dataset
df = pd.read_csv("/content/drive/MyDrive/Data_Science/iris.csv") #
Plot between -10 and 10 with .001 steps.
x_axis = np.arange(-20, 20, 0.01)
# Calculating mean and standard deviation
mean = df["sepal.length"].mean()
sd = df.loc[:,"sepal.width"].std()
plt.plot(x_axis, norm.pdf(x_axis, mean, sd))
plt.show()

Histograms plotting functions


A histogram is basically used to represent data provided in a form of some groups.It is
accurate method for the graphical representation of numerical data distribution.It is a type
of bar plot where X-axis represents the bin ranges while Y-axis gives information about
frequency.
Example
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.read_csv('/content/drive/MyDrive/Data_Science/iris.csv ')
data = df[' sepal.length']
31bins = np.arange(min(data), max(data) + 1, 1)
plt.hist(data, bins = bins, density = True)
plt.ylabel('sepal.width')
plt.xlabel( petal.length')
plt.show()
26
Sample Output

RESULT :
Thus the UCI data set was plotted using Normal Curve and Histogram plotting was
executed successfully.

27
DENSITY AND CONTOUR PLOTTING FUNCTIONS ON UCI-IRIS DATA SETS

AIM:
To apply and explore Density & Contour plotting functions on UCI-Iris data sets.

ALGORITHM:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the density and contour plotting for Iris data sets.

Density Plotting
Density Plot is a type of data visualization tool. It is a variation of the histogram that
uses ‘kernel smoothing’ while plotting the values. It is a continuous and smooth
version of a histogram inferred from a data. Density plots uses Kernel Density
Estimation (so they are also known as Kernel density estimation plots or KDE) which
is a probability density function. The region of plot with a higher peak is the region
with maximum data points residing between those values.

Example - Density plot of several variables


# libraries & dataset
import seaborn as sns
import matplotlib.pyplot as plt
# set a grey background (use sns.set_theme() if seaborn version 0.11.0 or above)
sns.set(style="darkgrid") df
= sns.load_dataset('iris')
# plotting both distibutions on the same figure
fig = sns.kdeplot(df['sepal_width'], shade=True, color="r") fig =
sns.kdeplot(df['sepal_length'], shade=True, color="b") plt.show()

Contour plotting
Contour plots also called level plots are a tool for doing multivariate analysis and
visualizing 3-D plots in 2-D space. If we consider X and Y as our variables we want to
plot then the response Z will be plotted as slices on the X-Y plane due to which
contours are sometimes referred as Z-slices or iso-response. Contour plots are widely
used to visualize density, altitudes or heights of the mountain as well as in the
meteorological department.

28
Example
Sample Output

RESULT
Thus the UCI data set was plotted using Density & Contour plotting was executed
successfully.

29
CORRELATION AND SCATTER PLOTTING FUNCTIONS ON UCI DATA SETS

AIM:
To apply correlation & Scatter plotting functions on UCI-Iris data sets.

ALGORITHM:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the correlation and scatter plotting for Iris data sets.

Correlation Matrix Plotting


Correlation gives an indication of how related the changes are between two
variables. If two variables change in the same direction they are positively
correlated. If the change in opposite directions together (one goes up, one goes
down), then they are negatively correlated. You can calculate the correlation
between each pair of attributes. This is called a correlation matrix. You can
then plot the correlation matrix and get an idea of which variables have a high
correlation with each other. This is useful to know, because some machine
learning algorithms like linear and logistic regression can have poor
performance if there are highly correlated input variables in your data.

Example
# Correction Matrix Plot import
matplotlib.pyplot as plt import
pandas
import numpy
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima
indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data =
pandas.read_csv(url, names=names)
correlations = data.corr()
# plot correlation matrix
Fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()

Scatter Plotting
A scatterplot shows the relationship between two variables as dots in two
dimensions, one axis for each attribute. You can create a scatterplot for each pair of
attributes in your data. Drawing all these scatterplots together is called a scatterplot
matrix. Scatter plots are useful for spotting structured relationships between

30
variables, like whether you could summarize the relationship between two variables
with a line. Attributes with structured relationships may also be correlated and good
candidates for removal from your dataset.
Example
# Scatterplot Matrix
import matplotlib.pyplot as plt import pandas
from pandas.plotting import scatter_matrix
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima
indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data =
pandas.read_csv(url, names=names)
scatter_matrix(data) plt.show()

Sample Output

RESULT :
Thus the UCI data set was plotted using Correlation and scatter plotting was executed
successfully.

31
VISUALIZING GEOGRAPHIC DATA WITH BASEMAP

Aim:
To visualize Geographic Data using BaseMap module in Python Programming.

Algorithm:
Step 1: Start the program
Step 2: Import the required packages
Step 3: Visualize Geographic Data with Basemap
Step 4: Display the Base map using built in method like basemap
along with latitude and longitude parameters
Step 5: Display the Coastal lines meters and Country boundaries
using built in methods
Step 6: Fill the Coastal lines meters and Country boundaries with
suitable colours.
Step 7: Create a global map with a Cylindrical Equidistant
Projection, Orthographic Projection, Robinson
ProjectionStep 8: Stop the program
Create a global map with a Ortho Projection

Program
%matplotlib inline import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50,
lon_0=-100) m.bluemarble(scale=0.5);

Output

Program
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None,
width=8E6, height=8E6,lat_0=45, lon_0=-100,) m.etopo(scale=0.5, alpha=0.5)
# Map (long, lat) to (x, y) for plotting x,
y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5) plt.text(x, y, '
INDIA', fontsize=12);

Output

32
Create a global map with a CoastlinesProgram
Program
fig = plt.figure(figsize = (12,12))
m= Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='dashed', color='red')
plt.title("Coastlines", fontsize=20)
plt.show()

Create a global map with a Country boundaries Program


fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='solid',
color='black') m.drawcountries()
plt.title("Country boundaries",
fontsize=20)x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5) plt.text(x, y, ' INDIA', fontsize=12); plt.show()

33
Output

Create a global map with a Mercator Projection Program


fig = plt.figure(figsize = (10,8))
m = Basemap(projection='merc',llcrnrlat=-80,urcrnrlat=80,llc
rnrlon=- 180,urcrnrlon=180)
m.drawcoastlines()
m.fillcontinents(color='tan',lake_col
or='lightblue')
m.drawcountries(linewidth=1, linestyle='solid',
color='k' )
m.drawmapboundary(fill_color='lightblue')
plt.title("Mercator Projection", fontsize=20)

Output

Create a global map with a Cylindrical Equidistant Projection Program


fig = plt.figure(figsize = (10,8))
m = Basemap(projection='cyl',llcrnrlat=-80,urcrnrlat=80,llcrnrlon=-

34
180,urcrnrlon=180)
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='lig
htblue') m.drawcountries(linewidth=1,
linestyle='solid', color='k' )
m.drawmapboundary(fill_color='lightblue' 41)
plt.title(" Cylindrical Equidistant Projection", fontsize=20)
Output

Create a global map with


Orthographic Projection Program
fig = plt.figure(figsize = (10,8))
m = Basemap(projection='ortho', lon_0 = 25, lat_0 = 10)
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='lig
htblue') m.drawcountries(linewidth=1,
linestyle='solid', color='k' )
m.drawmapboundary(fill_color='lightblue' )
plt.title("Orthographic Projection",
fontsize=18)
Output

Create a global map with a Robinson Projection Program


fig = plt.figure(figsize = (10,8))
m = Basemap(projection='robin',llcrnrlat=-80,urcrnrlat=80,llcrnrlon=-180,urcrn
rlon=180, lon_0 = 0, lat_0 = 0)
35
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='lig
htblue') m.drawcountries(linewidth=1,
linestyle='solid', color='k' )
m.drawmapboundary(fill_color='lightblue' )
plt.title(" Robinson Projection", fontsize=20)

Output

RESULT :
Thus Geographic Data has been visualized using Base Map module in
PythonProgramming successfully

36

You might also like