Cs3361 - Data Science Lab Record - PDF
Cs3361 - Data Science Lab Record - PDF
CHENNAI – 600119.
Name :
Register Number :
Subject Name :
Subject Code :
Batch :
JEPPIAAR NAGAR, RAJIVGANDHI SALAI
CHENNAI – 600119.
Date: Examiners
Internal:
External:
COLLEGE VISION & MISSION
Vision
To build Jeppiaar Engineering College as an institution of academic excellence in
technological and management education to become a world class university.
Mission
To excel in teaching and learning, research and innovation by promoting the
principles of scientific analysis and creative thinking.
To participate in the production, development and dissemination of knowledge and
interact with national and international communities.
To equip students with values, ethics and life skills needed to enrich their lives and
enable them to contribute for the progress of society.
To prepare students for higher studies and lifelong learning, enrich them with the
practical skills necessary to excel as future professionals and entrepreneurs for the
benefit of Nation’s economy.
Program Outcomes
Engineering knowledge: Apply the knowledge of mathematics, science, engineering
PO1 fundamentals, and an engineering specialization to the solution of complex engineering problems.
Problem analysis: Identify, formulate, review research literature, and analyze complex
PO2 engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
Design/development of solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
PO3 consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
Conduct investigations of complex problems: Use research-based knowledge and research
PO4 methods including design of experiments, analysis and interpretation of data, and synthesis of the
information to provide valid conclusions.
Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
PO5 engineering and IT tools including prediction and modeling to complex engineering activities with
an understanding of the limitations.
The engineer and society: Apply reasoning informed by the contextual knowledge to assess
PO6 societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to the
professional engineering practice.
Environment and sustainability: Understand the impact of the professional engineering
PO7 solutions in societal and environmental contexts, and demonstrate the knowledge of, and need for
sustainable development.
Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms
PO8
of the engineering practice.
Individual and team work: Function effectively as an individual, and as a member or leader in
PO9
diverse teams, and in multidisciplinary settings.
Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and write
PO10 effective reports and design documentation, make effective presentations, and give and receive
clear instructions.
Project management and finance: Demonstrate knowledge and understanding of the engineering
PO11 and management principles and apply these to one’s own work, as a member and leader in a team,
to manage projects and in multidisciplinary environments.
Life-long learning: Recognize the need for, and have the preparation and ability to engage in
PO12 independent and life-long learning in the broadest context of technological change.
DEPARTMENT OF INFORMATION TECHNOLOGY
Vision
To produce engineers with excellent knowledge in the field of Information Technology
through scientific and practical education to succeed in an increasingly complex world.
Mission
To demonstrate technical and operational excellence through creative and critical
thinking for the effective use of emerging technologies.
To involve in a constructive, team oriented environment and transfer knowledge to
enable global interaction.
To enrich students with professional integrity and ethical standards that will make
them deal social challenges successfully in their life.
To devise students for higher studies and perpetual learning, upgrade them as
competent engineers and entrepreneurs for country’s development.
PEO 1 To support students with substantial knowledge for developing and resolving
mathematical, scientific and engineering problems
PEO 2 To provide students with adequate training and opportunities to work as a
collaborator with informative and administrative qualities
PEO 3 To shape students with principled values to follow the code of ethics in social
and professional life
PEO 4 To motivate students for extensive learning to prepare them for graduate
studies, R&D and competitive exams
PEO 5 To cater the students with industrial exposure in an endeavor to succeed in the
emerging cutting edge technologies
EX PAGE
NAME OF THE EXPERIMENT DATE SIGN
NO. NO.
i. NUMPY PACKAGE
Description:
NumPy is the fundamental package needed for scientific computing with Python.
OUTPUT:
1
ii. SCIPY PACKAGE
Description
OUTPUT:
2
OUTPUT:
Description
Statsmodels is a Python module that allows users to explore data, estimate statistical models, and
perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting
functions, and result statistics are available for different types of data and each estimator.
Researchers across fields may find that statsmodels fully meets their needs for statistical
computing and data analysis in Python.
3
OUTPUT:
v. PANDAS PACKAGES
4
OUTPUT:
RESULT:
5
EX NO:
WORKING WITH NUMPY ARRAYS
AIM:
PROCEDURE:
6
PROGRAM
import numpy
numpy. version
OUTPUT:
'1.21.5'
Determining the size, shape, memory consumption, and data types of arrays.
We’ll start by defining three random arrays: a one-dimensional, two-dimensional, and three-
dimensional array.
import numpy as np
print("dtype:", x3.dtype)
OUTPUT:
x3 ndim: 3
x3 shape: (3, 4, 5)
x3 size: 60
7
dtype: int32
itemsize: 4 bytes
If you are familiar with Python’s standard list indexing, indexing in NumPy will feel
quite familiar. In a one-dimensional array, you can access the ith value (counting from zero) by
specifying the desired index in square brackets, just as with Python lists
To index from the end of the array, you can use negative indices.
import numpy as np
arr=np.array([5, 0, 3, 3, 7, 9])
print(arr[2])
print(arr[-1])
print(arr[5])
print(arr[-2])
OUTPUT:
3
import numpy as np
print(arr[0,0])
print(arr[2,-2])
8
OUTPUT:
import numpy as np
arr=np.array([0,1,2,3,4,5,6,7,8,9])
arr1=np.arange(10)
print(arr[1:5])
print(arr1[2:7])
print(arr[5:])
print(arr[:5])
print(arr1[-3:-1])
print(arr1[::2])
print(arr[0::3])
OUTPUT:
[1 2 3 4]
[2 3 4 5 6]
[5 6 7 8 9]
[0 1 2 3 4]
9
[7 8]
[0 2 4 6 8]
[0 3 6 9]
[9 8 7 6 5 4 3 2 1 0]
import numpy as np
OUTPUT:
[[12 5 2]
[ 7 6 8]]
[[12 2]
[ 7 8]
[ 1 7]]
[[ 7 7 6 1]
[ 8 8 6 7]
[ 4 2 5 12]]
v. Reshaping of Arrays
Changing the shape of a given array. Another useful type of operation is reshaping of arrays. The
most flexible way of doing this is with the reshape() method. For example, if you want to put the
numbers 1 through 9 in a 3×3 grid, you can do the following:
import numpy as np
grid = np.arange(0, 9)
print(grid)
10
print(grid.reshape(3,3))
OUTPUT:
[0 1 2 3 4 5 6 7 8]
[[0 1 2]
[3 4 5]
[6 7 8]]
RESULT:
11
EX NO:
WORKING WITH PANDAS DATA FRAMES
AIM:
PROCEDURE:
12
PROGRAM:
import pandas as pd
data={"calories":[420,380,390],"duration":[50,40,45]}
df=pd.DataFrame(data) #Load data into DataFrame object
print(df)
print(df.loc[0]) #Refer to the row index
print(df.loc[[0,1]]) #Use a list of indexes
OUTPUT:
calories duration
0 420 50
1 380 40
2 390 45
calories 420
duration 50
Name: 0, dtype: int64
import pandas as pd
data={"calories":[420,380,390],"duration":[50,40,45]}
df=pd.DataFrame(data,index=["day1","day2","day3"])
print(df)
OUTPUT:
calories duration
day1 420 50
day2 380 40
day3 390 45
13
b) Locate Named Indexes
Use the named index in the loc attribute to return the specified row(s).
import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
print(df)
# select two columns
print(df[['Name', 'Qualification']])
OUTPUT:
As is evident from the output, the keys of a dictionary is converted into columns of a dataframe
whereas the elements in lists are converted into rows.
14
Name Qualification
0 Jai Msc
1 Princi MA
2 Gaurav MCA
3 Anuj Phd
import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Adding index value explicitly
df = pd.DataFrame(data,index=['Rollno1','Rollno2','Rollno3','Rollno4'])
print(df)
OUTPUT:
Name Age Address Qualification
Rollno1 Jai 27 Delhi Msc
Rollno2 Princi 24 Kanpur MA
Rollno3 Gaurav 22 Allahabad MCA
Rollno4 Anuj 32 Kannauj Phd
import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
df = pd.DataFrame.from_dict(data) #from_dict() function
print(df)
15
OUTPUT:
Name Age Address Qualification
0 Jai 27 Delhi Msc
1 Princi 24 Kanpur MA
2 Gaurav 22 Allahabad MCA
3 Anuj 32 Kannauj Phd
OUTPUT
Name Age Address Qualification
0 Jai 27 Delhi Msc
1 Princi 24 Kanpur MA
2 Gaurav 22 Allahabad MCA
3 Anuj 32 Kannauj Phd
16
Column selection
In order to select a column in Pandas DataFrame, we can either access the columns by calling
them by their columns name.
import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# select two columns
print(df[['Name', 'Qualification']])
OUTPUT:
Name Qualification
0 Jai Msc
1 Princi MA
2 Gaurav MCA
3 Anuj Phd
Column Addition
In Order to add a column in Pandas DataFrame, we can declare a new list as a column and add to
a existing DataFrame.
import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# Declare a list that is to be converted into a column
address=['Delhi', 'Kanpur', 'Allahabad', 'Kannauj']
df['Address']=address
17
# select two columns
print(df)
OUTPUT:
Name Age Qualification Address
0 Jai 27 Msc Delhi
1 Princi 24 MA Kanpur
2 Gaurav 22 MCA Allahabad
3 Anuj 32 Phd Kannauj
v. Indexing and selecting data in Pandas DataFrame using [ ], loc & iloc
Indexing in Pandas means selecting rows and columns of data from a Dataframe. It can be
selecting all the rows and the particular number of columns, a particular number of rows, and
all the columns or a particular number of rows and columns each. Indexing is also known
as Subset selection.
import pandas as pd
# List of Tuples
18
print(df)
OUTPUT
# List of Tuples
19
# Create a DataFrame object from list
df = pd.DataFrame(employees,
# to select a column
result = df["City"]
print(result)
OUTPUT:
0 Varanasi
1 Delhi
2 Mumbai
3 Delhi
4 Delhi
5 Mumbai
6 Dehradun
7 Delhi
Example 2:
Select multiple columns.
import pandas as pd
# List of Tuples
20
('Saumya', 32, 'Delhi', 35000),
print(result)
OUTPUT:
0 Stuti 28 20000
1 Saumya 32 25000
2 Aaditya 25 40000
3 Saumya 32 35000
4 Saumya 32 30000
5 Saumya 32 20000
6 Aaditya 40 24000
7 Seema 32 70000
21
Select Rows by Name in Pandas DataFrame using loc
The .loc[] function selects the data by labels of rows or columns. It can select a subset of rows
and columns. There are many ways to use this function.
Example 1:
Select a single row.
import pandas as pd
# List of Tuples
# on a Dataframe
result = df.loc["Stuti"]
22
# Show the dataframe
print(result)
OUTPUT:
Age 28
City Varanasi
Salary 20000
Example 2:
Select multiple rows.
# import pandas
import pandas as pd
# List of Tuples
23
# on a Dataframe
result = df.loc[["Stuti","Seema","Aaditya"]]
print(result)
OUTPUT:
Example 3:
Select multiple rows and particular columns.
Syntax: Dataframe.loc[["row1", "row2"...], ["column1", "column2", "column3"...]]
# import pandas
import pandas as pd
# List of Tuples
24
('Saumya', 32, 'Delhi', 35000),
# on a Dataframe
# multiple columns
print(result)
OUTPUT:
25
Example 4:
Select all the rows with some particular columns. We use a single colon [ : ] to select all rows
and the list of columns that we want to select as given below :
Syntax: Dataframe.loc[[:, ["column1", "column2", "column3"]]
# import pandas
import pandas as pd
# List of Tuples
df = pd.DataFrame(employees,
# on a Dataframe
26
# select all the rows with
print(result)
OUTPUT:
Example 1:
# import pandas
27
import pandas as pd
# List of Tuples
result = df.iloc[2]
print(result)
OUTPUT:
Name Aaditya
Age 25
28
City Mumbai
Salary 40000
Example 2:
Select multiple rows.
# import pandas
import pandas as pd
# List of Tuples
df = pd.DataFrame(employees,
29
print(result)
OUTPUT:
Example 3:
Select multiple rows with some particular columns.
# import pandas
import pandas as pd
# List of Tuples
df = pd.DataFrame(employees,
30
columns =['Name', 'Age','City', 'Salary'])
print(result)
OUTPUT:
Name Age
2 Aaditya 25
3 Saumya 32
5 Saumya 32
Example 4:
Select all the rows with some particular columns.
# import pandas
import pandas as pd
# List of Tuples
31
('Saumya', 32, 'Delhi', 30000),
print(result)
OUTPUT:
Name Age
0 Stuti 28
1 Saumya 32
2 Aaditya 25
3 Saumya 32
4 Saumya 32
5 Saumya 32
6 Aaditya 40
7 Seema 32
32
RESULT:
33
EX NO: READING DATA FROM TEXT FILES, EXCEL, AND THE WEB AND
EXPLORING VARIOUS COMMANDS FOR DOING DESCRIPTIVE
ANALYTICS ON THE IRIS DATA SET
AIM:
PROCEDURE:
34
PROGRAM:
import pandas as pd
df = pd.read_csv("D:\iris_csv.csv")
print(df.head())
print(df.shape)
print(df.info())
print(df.describe())
print(df.isnull().sum())
print(df.sample(10))
print(df.columns)
print(df)
#data[start:end]
print(df[10:21])
sliced_data=df[10:21]
print(sliced_data)
# data["column_name"].sum()
sum_data = df["sepallength"].sum()
mean_data = df["sepallength"].mean()
median_data = df["sepallength"].median()
35
min_data=df["sepallength"].min()
max_data=df["sepallength"].max()
print(df["class"].value_counts())
OUTPUT:
(150, 5)
<class 'pandas.core.frame.DataFrame'>
36
3 petalwidth 150 non-null float64
None
sepallength 0
sepalwidth 0
petallength 0
petalwidth 0
class 0
dtype: int64
37
93 5.0 2.3 3.3 1.0 Iris-versicolor
38
14 5.8 4.0 1.2 0.2 Iris-setosa
Sum: 876.5
Mean: 5.843333333333335
Median: 5.8
Minimum: 4.3
Maximum: 7.9
Iris-setosa 50
39
Iris-versicolor 50
Iris-virginica 50
40
RESULT:
41
EX NO: USE THE DIABETES DATA SET FROM UCI AND PIMA INDIANS DIABETES
DATA SET FOR PERFORMING UNIVARIATE ANALYSIS, BIVARIATE
ANALYSIS AND MULTIPLE REGRESSION ANALYSIS
AIM:
PROCEDURE:
42
PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
# Reading the CSV file
df = pd.read_csv("D:\di.csv")
print(df)
#Mean
mean=df[["Pregnancies","Glucose","BP","Insulin","Diabetes","Age"]].mean()
print(mean)
#Median
median=df[["Pregnancies","Glucose","BP","Insulin","Diabetes","Age"]].median()
print(median)
#Mode
mode=df[["Pregnancies","Glucose","BP","Insulin","Diabetes","Age"]].mode()
print(mode)
#Variance
variance=df[["Pregnancies","Glucose","BP","Insulin","Diabetes","Age"]].var()
print(variance)
#StandardDeviation
sd=df[["Pregnancies","Glucose","BP","Insulin","Diabetes","Age"]].std()
print(sd)
43
b. Bivariate analysis: Linear and logistic regression modeling
OUTPUT:
0 2 148 72 0 0.627 50
1 1 85 66 0 0.351 31
2 3 183 64 0 0.672 32
3 0 89 66 94 0.167 21
5 2 116 74 0 0.201 30
6 0 78 50 88 0.248 26
7 5 115 0 0 0.134 29
9 1 166 96 0 0.232 54
44
Pregnancies 2.1000
Glucose 131.4000
BP 59.8000
Insulin 89.3000
Diabetes 0.5078
Age 35.9000
dtype: float64
Pregnancies 2.00
Glucose 126.50
BP 66.00
Insulin 0.00
Diabetes 0.24
Age 31.50
dtype: float64
45
Pregnancies 2.766667
Glucose 1753.155556
BP 658.177778
Insulin 28878.677778
Diabetes 0.427865
Age 140.988889
dtype: float64
Pregnancies 1.663330
Glucose 41.870700
BP 25.654976
Insulin 169.937276
Diabetes 0.654114
Age 11.873874
dtype: float64
C:\Users\NIRMALKUMAR\anaconda3\lib\site-packages\scipy\stats\stats.py:1541:
UserWarning: kurtosistest only valid for n>=20 .... continuing anyway, n=10
====================================================================
==========
46
Df Residuals: 8 BIC: 98.35
Df Model: 1
====================================================================
==========
------------------------------------------------------------------------------
====================================================================
==========
====================================================================
==========
47
RESULT:
48
EX NO: APPLY AND EXPLORE VARIOUS PLOTTING FUNCTIONS ON UCI DATA
SETS.
AIM:
PROCEDURE:
49
PROGRAM
a. Normal curves
import numpy as np
import pandas as pd
plt.style.use('seaborn-whitegrid')
x = np.linspace(1,100,50)
#Creating a Function.
return prob_density
mean = np.mean(x)
sd = np.std(x)
pdf = normal_dist(x,mean,sd)
plt.xlabel('Data points')
plt.ylabel('Probability Density')
50
b. Density and contour plots
#%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
import numpy as np
def f(x, y):
return np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 40)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
plt.contour(X, Y, Z, colors='black') #Visualizing three-dimensional data with contours
plt.contour(X, Y, Z, 20, cmap='RdGy') #Visualizing three-dimensional data with colored
contours
import numpy as np
import pandas as pd
rand=np.random.RandomState(10)
x=rand.randint(100,size=20)
y = np.sin(x)
51
d. Histograms
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
rand=np.random.RandomState(0)
x=rand.randint(10,size=5)
plt.hist(x)
52
OUTPUT:
a) Normal Curve
53
c) Correlation and Scatter Plots
54
d) Histogram
55
RESULT:
56
EX NO:
VISUALIZING GEOGRAPHIC DATA WITH BASEMAP
AIM:
PROCEDURE:
57
Basemap() Package Installation
Installation of Basemap is straightforward; if you’re using conda you can type this and the
package will be downloaded:
Description
Basemap toolkit is a library for plotting 2D data on maps in Python. It is similar in functionality
to the matlab mapping toolbox, the IDL mapping facilities, GrADS, or the Generic Mapping
Tools.
58
59
PROGRAM:
60
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='lightblue')
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
m.drawmapboundary(fill_color='lightblue')
plt.title("Orthographic Projection", fontsize=18)
OUTPUT:
61
62
63
RESULT:
64