[go: up one dir, main page]

0% found this document useful (0 votes)
32 views69 pages

Cs3361 - Data Science Lab Record - PDF

The document outlines the academic structure and objectives of the Department of Information Technology at Jeppiaar Engineering College for the academic year 2023-24, including the vision, mission, program outcomes, and specific course outcomes for the Data Science Laboratory. It details practical experiments involving Python packages such as NumPy, SciPy, Jupyter, Statsmodels, and Pandas, along with instructions for installation and usage. Additionally, it includes guidelines for working with NumPy arrays and Pandas DataFrames, emphasizing the application of data analysis techniques.

Uploaded by

timmy345555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views69 pages

Cs3361 - Data Science Lab Record - PDF

The document outlines the academic structure and objectives of the Department of Information Technology at Jeppiaar Engineering College for the academic year 2023-24, including the vision, mission, program outcomes, and specific course outcomes for the Data Science Laboratory. It details practical experiments involving Python packages such as NumPy, SciPy, Jupyter, Statsmodels, and Pandas, along with instructions for installation and usage. Additionally, it includes guidelines for working with NumPy arrays and Pandas DataFrames, emphasizing the application of data analysis techniques.

Uploaded by

timmy345555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

JEPPIAAR NAGAR, RAJIVGANDHI SALAI

CHENNAI – 600119.

DEPARTMENT OF INFORMATION TECHNOLOGY

II YEAR B.TECH – III SEM

ACADEMIC YEAR 2023 -24 (ODD SEM )

Name :

Register Number :

Subject Name :

Subject Code :

Batch :
JEPPIAAR NAGAR, RAJIVGANDHI SALAI

CHENNAI – 600119.

DEPARTMENT OF INFORMATION TECHNOLOGY

This is a Bonafide Record Work of

Register No. submitted for the Anna University Practical

Examination held on in CS3361 – DATA SCIENCE

LABORATORY during the year .

Signature of the Lab-In-Charge Signature of the HOD

Date: Examiners

Internal:

External:
COLLEGE VISION & MISSION
Vision
To build Jeppiaar Engineering College as an institution of academic excellence in
technological and management education to become a world class university.
Mission
 To excel in teaching and learning, research and innovation by promoting the
principles of scientific analysis and creative thinking.
 To participate in the production, development and dissemination of knowledge and
interact with national and international communities.
 To equip students with values, ethics and life skills needed to enrich their lives and
enable them to contribute for the progress of society.
 To prepare students for higher studies and lifelong learning, enrich them with the
practical skills necessary to excel as future professionals and entrepreneurs for the
benefit of Nation’s economy.
Program Outcomes
Engineering knowledge: Apply the knowledge of mathematics, science, engineering
PO1 fundamentals, and an engineering specialization to the solution of complex engineering problems.
Problem analysis: Identify, formulate, review research literature, and analyze complex
PO2 engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
Design/development of solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
PO3 consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
Conduct investigations of complex problems: Use research-based knowledge and research
PO4 methods including design of experiments, analysis and interpretation of data, and synthesis of the
information to provide valid conclusions.
Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
PO5 engineering and IT tools including prediction and modeling to complex engineering activities with
an understanding of the limitations.
The engineer and society: Apply reasoning informed by the contextual knowledge to assess
PO6 societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to the
professional engineering practice.
Environment and sustainability: Understand the impact of the professional engineering
PO7 solutions in societal and environmental contexts, and demonstrate the knowledge of, and need for
sustainable development.
Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms
PO8
of the engineering practice.
Individual and team work: Function effectively as an individual, and as a member or leader in
PO9
diverse teams, and in multidisciplinary settings.
Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and write
PO10 effective reports and design documentation, make effective presentations, and give and receive
clear instructions.
Project management and finance: Demonstrate knowledge and understanding of the engineering
PO11 and management principles and apply these to one’s own work, as a member and leader in a team,
to manage projects and in multidisciplinary environments.
Life-long learning: Recognize the need for, and have the preparation and ability to engage in
PO12 independent and life-long learning in the broadest context of technological change.
DEPARTMENT OF INFORMATION TECHNOLOGY
Vision
To produce engineers with excellent knowledge in the field of Information Technology
through scientific and practical education to succeed in an increasingly complex world.

Mission
 To demonstrate technical and operational excellence through creative and critical
thinking for the effective use of emerging technologies.
 To involve in a constructive, team oriented environment and transfer knowledge to
enable global interaction.
 To enrich students with professional integrity and ethical standards that will make
them deal social challenges successfully in their life.
 To devise students for higher studies and perpetual learning, upgrade them as
competent engineers and entrepreneurs for country’s development.

Program Educational Objectives (PEOs)

PEO 1 To support students with substantial knowledge for developing and resolving
mathematical, scientific and engineering problems
PEO 2 To provide students with adequate training and opportunities to work as a
collaborator with informative and administrative qualities
PEO 3 To shape students with principled values to follow the code of ethics in social
and professional life
PEO 4 To motivate students for extensive learning to prepare them for graduate
studies, R&D and competitive exams
PEO 5 To cater the students with industrial exposure in an endeavor to succeed in the
emerging cutting edge technologies

Program Specific Outcomes


Students are able to analyze, design, implement and test any software with the
PSO1
programming and testing skills they have acquired.
Students are able to design algorithms, data management to meet desired needs, for real
PSO2
time problems through analytical, logical and problem solving skills.
Students are able to provide security solutions for network components and data storage
PSO3
& management which will enable them to work in the industry ethically.

Course Outcomes (COs)


C407.1 Implement the cipher techniques
C407.2 Develop the various private key based security algorithms
C407.3 Develop the various Public key based algorithms
C407.4 Implement the Digital Signature Scheme.
C407.5 Analyze different open source tools for network security and analysis
INDEX

EX PAGE
NAME OF THE EXPERIMENT DATE SIGN
NO. NO.

DOWNLOAD, INSTALL AND EXPLORE


THE FEATURES OF NUMPY, SCIPY,
1
JUPYTER, STATSMODELS AND PANDAS
PACKAGES.

WORKING WITH NUMPY ARRAYS


2

3 WORKING WITH PANDAS DATA FRAMES

READING DATA FROM TEXT FILES,


EXCEL AND THE WEB AND EXPLORING
VARIOUS COMMANDS FOR DOING
4
DESCRIPTIVE ANALYTICS ON THE IRIS
DATA SET.

USE THE DIABETES DATA SET FROM UCI


AND PIMA INDIANS DIABETES DATA SET
5 FOR PERFORMING UNIVARIATE
ANALYSIS, BIVARIATE ANALYSIS AND
MULTIPLE REGRESSION ANALYSIS

APPLY AND EXPLORE VARIOUS


PLOTTING FUNCTIONS ON UCI DATA
6
SETS.

VISUALIZING GEOGRAPHIC DATA WITH


7 BASEMAP
EX NO: DOWNLOAD, INSTALL AND EXPLORE THE FEATURES OF NUMPY, SCIPY,
JUPYTER, STATSMODELS AND PANDAS PACKAGES

i. NUMPY PACKAGE

Array processing for numbers, strings, records, and objects.

To install this package with conda run:


conda install -c anaconda numpy

Description:

NumPy is the fundamental package needed for scientific computing with Python.

How to check the Numpy version

1. Use the pip list or pip3 list command.


2. From command line type: pip3 show numpy or pip show numpy.

OUTPUT:

1
ii. SCIPY PACKAGE

Scientific Library for Python


To install this package with conda run:
conda install -c anaconda scipy

Description

SciPy is a Python-based ecosystem of open-source software for mathematics, science, and


engineering.

OUTPUT:

iii. JUPYTER PACKAGE


To install Jupyter using Anaconda, just go through the following instructions:

1. Launch Anaconda Navigator:


2. Click on the Install Jupyter Notebook Button:
3. Beginning the Installation:
4. Loading Packages:
5. Finished Installation:

JupyterLab. Install JupyterLab with pip : pip install jupyterlab.

2
OUTPUT:

iv. STATSMODELS PACKAGE

Statistical computations and models for use with SciPy

To install this package with conda run:


conda install -c anaconda statsmodels

Description
Statsmodels is a Python module that allows users to explore data, estimate statistical models, and
perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting
functions, and result statistics are available for different types of data and each estimator.
Researchers across fields may find that statsmodels fully meets their needs for statistical
computing and data analysis in Python.

3
OUTPUT:

v. PANDAS PACKAGES

install pandas conda

conda install -c anaconda pandas

pandas pip install

pip install pandas

4
OUTPUT:

RESULT:

5
EX NO:
WORKING WITH NUMPY ARRAYS

AIM:

PROCEDURE:

6
PROGRAM

i. To check the Numpy version:

import numpy

numpy. version

OUTPUT:

'1.21.5'

ii. Attributes of arrays or Numpy Array Attibutes

Determining the size, shape, memory consumption, and data types of arrays.
We’ll start by defining three random arrays: a one-dimensional, two-dimensional, and three-
dimensional array.

import numpy as np

np.random.seed(0) # seed for reproducibility

x1 = np.random.randint(10, size=6) # One-dimensional array

x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array

x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array

print("x3 ndim: ", x3.ndim)

print("x3 shape:", x3.shape)

print("x3 size: ", x3.size)

print("dtype:", x3.dtype)

print("itemsize:", x3.itemsize, "bytes")

print("nbytes:", x3.nbytes, "bytes")

OUTPUT:

x3 ndim: 3

x3 shape: (3, 4, 5)

x3 size: 60

7
dtype: int32

itemsize: 4 bytes

nbytes: 240 bytes

iii. Indexing of Arrays


Getting and setting the value of individual array elements.
Array Indexing: Accessing Single Elements

If you are familiar with Python’s standard list indexing, indexing in NumPy will feel
quite familiar. In a one-dimensional array, you can access the ith value (counting from zero) by
specifying the desired index in square brackets, just as with Python lists
To index from the end of the array, you can use negative indices.

import numpy as np

arr=np.array([5, 0, 3, 3, 7, 9])

print(arr[2])

print(arr[-1])

print(arr[5])

print(arr[-2])

OUTPUT:
3

Array Indexing: Multidimensional Array

In a multidimensional array, you access items using a comma-separated tuple of indices:

import numpy as np

arr=np.array([[3, 5, 2, 4],[7, 6, 8, 8],[1, 6, 7, 7]])

print(arr[0,0])

print(arr[2,-2])

8
OUTPUT:

iv. Slicing of Arrays

Getting and setting smaller subarrays within a larger array

Array Slicing: Accessing Subarrays


Just as we can use square brackets to access individual array elements, we can also use them to
access subarrays with the slice notation, marked by the colon (:) character. The NumPy slicing
syntax follows that of the standard Python list; to access a slice of an array x, use this:
x[start:stop:step]
If any of these are unspecified, they default to the values start=0, stop=size of dimension, step=1.

import numpy as np

arr=np.array([0,1,2,3,4,5,6,7,8,9])

arr1=np.arange(10)

print(arr[1:5])

print(arr1[2:7])

print(arr[5:])

print(arr[:5])

print(arr1[-3:-1])

print(arr1[::2])

print(arr[0::3])

print(arr1[::-1]) # all elements, reversed

OUTPUT:

[1 2 3 4]

[2 3 4 5 6]

[5 6 7 8 9]

[0 1 2 3 4]

9
[7 8]

[0 2 4 6 8]

[0 3 6 9]

[9 8 7 6 5 4 3 2 1 0]

Array Slicing: Multidimensional subarrays

import numpy as np

arr=np.array([[12, 5, 2, 4],[ 7, 6, 8, 8],[ 1, 6, 7, 7]])

print(arr[:2, :3]) # two rows, three columns

print(arr[:3, ::2]) # all rows, every other column

print(arr[::-1, ::-1]) #Finally, subarray dimensions can even be reversed together:

OUTPUT:

[[12 5 2]

[ 7 6 8]]

[[12 2]

[ 7 8]

[ 1 7]]

[[ 7 7 6 1]

[ 8 8 6 7]

[ 4 2 5 12]]

v. Reshaping of Arrays
Changing the shape of a given array. Another useful type of operation is reshaping of arrays. The
most flexible way of doing this is with the reshape() method. For example, if you want to put the
numbers 1 through 9 in a 3×3 grid, you can do the following:

import numpy as np

grid = np.arange(0, 9)

print(grid)

10
print(grid.reshape(3,3))

OUTPUT:

[0 1 2 3 4 5 6 7 8]

[[0 1 2]

[3 4 5]

[6 7 8]]

RESULT:

11
EX NO:
WORKING WITH PANDAS DATA FRAMES

AIM:

PROCEDURE:

12
PROGRAM:

A Pandas DataFrame is a 2 dimesional data structure , like a 2 dimensional array, or a table


with rows and columns.
i. Create a Simple Pandas DataFrame

import pandas as pd
data={"calories":[420,380,390],"duration":[50,40,45]}
df=pd.DataFrame(data) #Load data into DataFrame object
print(df)
print(df.loc[0]) #Refer to the row index
print(df.loc[[0,1]]) #Use a list of indexes

OUTPUT:
calories duration
0 420 50
1 380 40
2 390 45

calories 420
duration 50
Name: 0, dtype: int64

ii. a) Named Indexes


With the index argument, you can name your own indexes.

import pandas as pd
data={"calories":[420,380,390],"duration":[50,40,45]}
df=pd.DataFrame(data,index=["day1","day2","day3"])
print(df)

OUTPUT:
calories duration
day1 420 50
day2 380 40
day3 390 45

13
b) Locate Named Indexes
Use the named index in the loc attribute to return the specified row(s).

Use the syntax:


print(df.loc["day2"])
OUTPUT:
calories 380
duration 40
Name: day2, dtype: int64

iii. Creating Pandas dataframe from lists using dictionary

Method #1: Creating DataFrame using dictionary of lists


a)
With this method in pandas, we can transform a dictionary of lists into a dataframe.

import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
print(df)
# select two columns
print(df[['Name', 'Qualification']])

OUTPUT:
As is evident from the output, the keys of a dictionary is converted into columns of a dataframe
whereas the elements in lists are converted into rows.

Name Age Address Qualification


0 Jai 27 Delhi Msc
1 Princi 24 Kanpur MA
2 Gaurav 22 Allahabad MCA
3 Anuj 32 Kannauj Phd

14
Name Qualification
0 Jai Msc
1 Princi MA
2 Gaurav MCA
3 Anuj Phd

b) Adding index to a dataframe explicitly

import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Adding index value explicitly
df = pd.DataFrame(data,index=['Rollno1','Rollno2','Rollno3','Rollno4'])
print(df)

OUTPUT:
Name Age Address Qualification
Rollno1 Jai 27 Delhi Msc
Rollno2 Princi 24 Kanpur MA
Rollno3 Gaurav 22 Allahabad MCA
Rollno4 Anuj 32 Kannauj Phd

Method #2: Using from_dict() function

import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
df = pd.DataFrame.from_dict(data) #from_dict() function
print(df)

15
OUTPUT:
Name Age Address Qualification
0 Jai 27 Delhi Msc
1 Princi 24 Kanpur MA
2 Gaurav 22 Allahabad MCA
3 Anuj 32 Kannauj Phd

Method #3: Creating dataframe by passing lists variables to dictionary


import pandas as pd
#Dictionary of lists
name=['Jai', 'Princi', 'Gaurav', 'Anuj']
age=[27, 24, 22, 32]
address=['Delhi', 'Kanpur', 'Allahabad', 'Kannauj']
qualification=['Msc', 'MA', 'MCA', 'Phd']
data={'Name':name,'Age':age,'Address':address,'Qualification':qualification}
df=pd.DataFrame(data)
print(df)
df1=pd.DataFrame(data,index=['No1','No2','No3','No4']) #Explicitly add index value
print(df1)

OUTPUT
Name Age Address Qualification
0 Jai 27 Delhi Msc
1 Princi 24 Kanpur MA
2 Gaurav 22 Allahabad MCA
3 Anuj 32 Kannauj Phd

Name Age Address Qualification


No1 Jai 27 Delhi Msc
No2 Princi 24 Kanpur MA
No3 Gaurav 22 Allahabad MCA
No4 Anuj 32 Kannauj Phd

iv. Dealing with Rows and Columns in Pandas dataFrame


A Data frame is a two-dimensional data structure, i.e, data is aligned in a tabular fashion in rows
and columns. We can perform basic operations on rows/columns like selecting, deleting, adding
and renaming.

a) Dealing with columns


In order to deal with columns, we perform basic operations on columns like selecting, deleting,
adding and renaming.

16
Column selection
In order to select a column in Pandas DataFrame, we can either access the columns by calling
them by their columns name.

import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# select two columns
print(df[['Name', 'Qualification']])

OUTPUT:
Name Qualification
0 Jai Msc
1 Princi MA
2 Gaurav MCA
3 Anuj Phd

Column Addition
In Order to add a column in Pandas DataFrame, we can declare a new list as a column and add to
a existing DataFrame.

import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# Declare a list that is to be converted into a column
address=['Delhi', 'Kanpur', 'Allahabad', 'Kannauj']
df['Address']=address

17
# select two columns
print(df)

OUTPUT:
Name Age Qualification Address
0 Jai 27 Msc Delhi
1 Princi 24 MA Kanpur
2 Gaurav 22 MCA Allahabad
3 Anuj 32 Phd Kannauj

v. Indexing and selecting data in Pandas DataFrame using [ ], loc & iloc
Indexing in Pandas means selecting rows and columns of data from a Dataframe. It can be
selecting all the rows and the particular number of columns, a particular number of rows, and
all the columns or a particular number of rows and columns each. Indexing is also known
as Subset selection.

Creating a Dataframe to Select Rows & Columns in Pandas


A list of tuples, say column names are: ‘Name’, ‘Age’, ‘City’, and ‘Salary’.

import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),

('Saumya', 32, 'Delhi', 25000),

('Aaditya', 25, 'Mumbai', 40000),

('Saumya', 32, 'Delhi', 35000),

('Saumya', 32, 'Delhi', 30000),

('Saumya', 32, 'Mumbai', 20000),

('Aaditya', 40, 'Dehradun', 24000),

('Seema', 32, 'Delhi', 70000)]

# Create a DataFrame object from list

df = pd.DataFrame(employees,columns =['Name', 'Age','City', 'Salary'])

# Show the dataframe

18
print(df)
OUTPUT

Name Age City Salary

0 Stuti 28 Varanasi 20000

1 Saumya 32 Delhi 25000

2 Aaditya 25 Mumbai 40000

3 Saumya 32 Delhi 35000

4 Saumya 32 Delhi 30000

5 Saumya 32 Mumbai 20000

6 Aaditya 40 Dehradun 24000

7 Seema 32 Delhi 70000

Select Columns by Name in Pandas DataFrame using [ ]


The [ ] is used to select a column by mentioning the respective column name.
Example 1:
Select a single column.
import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),

('Saumya', 32, 'Delhi', 25000),

('Aaditya', 25, 'Mumbai', 40000),

('Saumya', 32, 'Delhi', 35000),

('Saumya', 32, 'Delhi', 30000),

('Saumya', 32, 'Mumbai', 20000),

('Aaditya', 40, 'Dehradun', 24000),

('Seema', 32, 'Delhi', 70000)]

19
# Create a DataFrame object from list

df = pd.DataFrame(employees,

columns=['Name', 'Age','City', 'Salary'])

# Using the operator []

# to select a column

result = df["City"]

# Show the dataframe

print(result)

OUTPUT:

0 Varanasi

1 Delhi

2 Mumbai

3 Delhi

4 Delhi

5 Mumbai

6 Dehradun

7 Delhi

Name: City, dtype: object

Example 2:
Select multiple columns.
import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),

('Saumya', 32, 'Delhi', 25000),

('Aaditya', 25, 'Mumbai', 40000),

20
('Saumya', 32, 'Delhi', 35000),

('Saumya', 32, 'Delhi', 30000),

('Saumya', 32, 'Mumbai', 20000),

('Aaditya', 40, 'Dehradun', 24000),

('Seema', 32, 'Delhi', 70000)]

# Create a DataFrame object from list

df = pd.DataFrame(employees,columns =['Name', 'Age', 'City', 'Salary'])

# Using the operator [] to

# select multiple columns

result = df[["Name", "Age", "Salary"]]

# Show the dataframe

print(result)

OUTPUT:

Name Age Salary

0 Stuti 28 20000

1 Saumya 32 25000

2 Aaditya 25 40000

3 Saumya 32 35000

4 Saumya 32 30000

5 Saumya 32 20000

6 Aaditya 40 24000

7 Seema 32 70000

21
Select Rows by Name in Pandas DataFrame using loc
The .loc[] function selects the data by labels of rows or columns. It can select a subset of rows
and columns. There are many ways to use this function.
Example 1:
Select a single row.
import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),

('Saumya', 32, 'Delhi', 25000),

('Aaditya', 25, 'Mumbai', 40000),

('Saumya', 32, 'Delhi', 35000),

('Saumya', 32, 'Delhi', 30000),

('Saumya', 32, 'Mumbai', 20000),

('Aaditya', 40, 'Dehradun', 24000),

('Seema', 32, 'Delhi', 70000)]

# Create a DataFrame object from list

df = pd.DataFrame(employees,columns =['Name', 'Age','City', 'Salary'])

# Set 'Name' column as index

# on a Dataframe

df.set_index("Name", inplace = True)

# Using the operator .loc[]

# to select single row

result = df.loc["Stuti"]

22
# Show the dataframe

print(result)

OUTPUT:

Age 28

City Varanasi

Salary 20000

Name: Stuti, dtype: object

Example 2:
Select multiple rows.

# import pandas

import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),

('Saumya', 32, 'Delhi', 25000),

('Aaditya', 25, 'Mumbai', 40000),

('Saumya', 32, 'Delhi', 35000),

('Saumya', 32, 'Delhi', 30000),

('Saumya', 32, 'Mumbai', 20000),

('Aaditya', 40, 'Dehradun', 24000),

('Seema', 32, 'Delhi', 70000)]

# Create a DataFrame object from list

df = pd.DataFrame(employees,columns =['Name', 'Age','City', 'Salary'])

# Set 'Name' column as index

23
# on a Dataframe

df.set_index("Name", inplace = True)

# Using the operator .loc[]

# to select single row

result = df.loc[["Stuti","Seema","Aaditya"]]

# Show the dataframe

print(result)

OUTPUT:

Name Age City Salary

Stuti 28 Varanasi 20000

Seema 32 Delhi 70000

Aaditya 25 Mumbai 40000

Aaditya 40 Dehradun 24000

Example 3:
Select multiple rows and particular columns.
Syntax: Dataframe.loc[["row1", "row2"...], ["column1", "column2", "column3"...]]

# import pandas

import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),

('Saumya', 32, 'Delhi', 25000),

('Aaditya', 25, 'Mumbai', 40000),

24
('Saumya', 32, 'Delhi', 35000),

('Saumya', 32, 'Delhi', 30000),

('Saumya', 32, 'Mumbai', 20000),

('Aaditya', 40, 'Dehradun', 24000),

('Seema', 32, 'Delhi', 70000)]

# Create a DataFrame object from list

df = pd.DataFrame(employees,columns =['Name', 'Age','City', 'Salary'])

# Set 'Name' column as index

# on a Dataframe

df.set_index("Name", inplace = True)

# Using the operator .loc[] to

# select multiple rows with some

# multiple columns

result = df.loc[["Stuti", "Seema"],["City", "Salary"]]

# Show the dataframe

print(result)

OUTPUT:

Name City Salary

Stuti Varanasi 20000

Seema Delhi 70000

25
Example 4:
Select all the rows with some particular columns. We use a single colon [ : ] to select all rows
and the list of columns that we want to select as given below :
Syntax: Dataframe.loc[[:, ["column1", "column2", "column3"]]

# import pandas

import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),

('Saumya', 32, 'Delhi', 25000),

('Aaditya', 25, 'Mumbai', 40000),

('Saumya', 32, 'Delhi', 35000),

('Saumya', 32, 'Delhi', 30000),

('Saumya', 32, 'Mumbai', 20000),

('Aaditya', 40, 'Dehradun', 24000),

('Seema', 32, 'Delhi', 70000)]

# Creating a DataFrame object from list

df = pd.DataFrame(employees,

columns =['Name', 'Age','City', 'Salary'])

# Set 'Name' column as index

# on a Dataframe

df.set_index("Name", inplace = True)

# Using the operator .loc[] to

26
# select all the rows with

# some particular columns

result = df.loc[:, ["City", "Salary"]]

# Show the dataframe

print(result)

OUTPUT:

Name City Salary

Stuti Varanasi 20000

Saumya Delhi 25000

Aaditya Mumbai 40000

Saumya Delhi 35000

Saumya Delhi 30000

Saumya Mumbai 20000

Aaditya Dehradun 24000

Seema Delhi 70000

vi. Select Rows by Index in Pandas DataFrame using iloc


The iloc[ ] is used for selection based on position. It is similar to loc[] indexer but it takes only
integer values to make selections.

Example 1:

select a single row.

# import pandas

27
import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),

('Saumya', 32, 'Delhi', 25000),

('Aaditya', 25, 'Mumbai', 40000),

('Saumya', 32, 'Delhi', 35000),

('Saumya', 32, 'Delhi', 30000),

('Saumya', 32, 'Mumbai', 20000),

('Aaditya', 40, 'Dehradun', 24000),

('Seema', 32, 'Delhi', 70000)]

# Create a DataFrame object from list

df = pd.DataFrame(employees,columns =['Name', 'Age','City', 'Salary'])

# Using the operator .iloc[]

# to select single row

result = df.iloc[2]

# Show the dataframe

print(result)

OUTPUT:

Name Aaditya

Age 25

28
City Mumbai

Salary 40000

Name: 2, dtype: object

Example 2:
Select multiple rows.

# import pandas

import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),

('Saumya', 32, 'Delhi', 25000),

('Aaditya', 25, 'Mumbai', 40000),

('Saumya', 32, 'Delhi', 35000),

('Saumya', 32, 'Delhi', 30000),

('Saumya', 32, 'Mumbai', 20000),

('Aaditya', 40, 'Dehradun', 24000),

('Seema', 32, 'Delhi', 70000)]

# Create a DataFrame object from list

df = pd.DataFrame(employees,

columns=['Name', 'Age','City', 'Salary'])

# Using the operator .iloc[]

# to select multiple rows

result = df.iloc[[2, 3, 5]]

# Show the dataframe

29
print(result)

OUTPUT:

Name Age City Salary

2 Aaditya 25 Mumbai 40000

3 Saumya 32 Delhi 35000

5 Saumya 32 Mumbai 20000

Example 3:
Select multiple rows with some particular columns.

# import pandas

import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),

('Saumya', 32, 'Delhi', 25000),

('Aaditya', 25, 'Mumbai', 40000),

('Saumya', 32, 'Delhi', 35000),

('Saumya', 32, 'Delhi', 30000),

('Saumya', 32, 'Mumbai', 20000),

('Aaditya', 40, 'Dehradun', 24000),

('Seema', 32, 'Delhi', 70000)]

# Creating a DataFrame object from list

df = pd.DataFrame(employees,

30
columns =['Name', 'Age','City', 'Salary'])

# Using the operator .iloc[]

# to select multiple rows with

# some particular columns

result = df.iloc[[2, 3, 5],[0, 1]]

# Show the dataframe

print(result)

OUTPUT:

Name Age

2 Aaditya 25

3 Saumya 32

5 Saumya 32

Example 4:
Select all the rows with some particular columns.

# import pandas

import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),

('Saumya', 32, 'Delhi', 25000),

('Aaditya', 25, 'Mumbai', 40000),

('Saumya', 32, 'Delhi', 35000),

31
('Saumya', 32, 'Delhi', 30000),

('Saumya', 32, 'Mumbai', 20000),

('Aaditya', 40, 'Dehradun', 24000),

('Seema', 32, 'Delhi', 70000)]

# Create a DataFrame object from list

df = pd.DataFrame(employees,columns =['Name', 'Age','City', 'Salary'])

# Using the operator .iloc[]

# to select all the rows with

# some particular columns

result = df.iloc[:, [0, 1]]

# Show the dataframe

print(result)

OUTPUT:

Name Age

0 Stuti 28

1 Saumya 32

2 Aaditya 25

3 Saumya 32

4 Saumya 32

5 Saumya 32

6 Aaditya 40

7 Seema 32

32
RESULT:

33
EX NO: READING DATA FROM TEXT FILES, EXCEL, AND THE WEB AND
EXPLORING VARIOUS COMMANDS FOR DOING DESCRIPTIVE
ANALYTICS ON THE IRIS DATA SET

AIM:

PROCEDURE:

34
PROGRAM:

import pandas as pd

# Reading the CSV file

df = pd.read_csv("D:\iris_csv.csv")

# Printing top 5 rows

print(df.head())

print(df.shape)

print(df.info())

print(df.describe())

print(df.isnull().sum())

print(df.sample(10))

print(df.columns)

print(df)

#data[start:end]

#start is inclusive whereas end is exclusive

print(df[10:21])

# it will print the rows from 10 to 20.

# you can also save it in a variable for further use in analysis

sliced_data=df[10:21]

print(sliced_data)

# data["column_name"].sum()

sum_data = df["sepallength"].sum()

mean_data = df["sepallength"].mean()

median_data = df["sepallength"].median()

print("Sum:",sum_data, "\nMean:", mean_data, "\nMedian:",median_data)

35
min_data=df["sepallength"].min()

max_data=df["sepallength"].max()

print("Minimum:",min_data, "\nMaximum:", max_data)

print(df["class"].value_counts())

# The pandas plot extenstion can be used to make a scatterplot

# Display your plot with plt.show

df.plot(kind="scatter", x="sepallength", y="sepalwidth")

#To change color and size, add the following:

df.plot(kind="scatter", x="sepallength", y="sepalwidth",color="green",s=70)

OUTPUT:

sepallength sepalwidth petallength petalwidth class

0 5.1 3.5 1.4 0.2 Iris-setosa

1 4.9 3.0 1.4 0.2 Iris-setosa

2 4.7 3.2 1.3 0.2 Iris-setosa

3 4.6 3.1 1.5 0.2 Iris-setosa

4 5.0 3.6 1.4 0.2 Iris-setosa

(150, 5)

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 150 entries, 0 to 149

Data columns (total 5 columns):

# Column Non-Null Count Dtype

0 sepallength 150 non-null float64

1 sepalwidth 150 non-null float64

2 petallength 150 non-null float64

36
3 petalwidth 150 non-null float64

4 class 150 non-null object

dtypes: float64(4), object(1)

memory usage: 6.0+ KB

None

sepallength sepalwidth petallength petalwidth

count 150.000000 150.000000 150.000000 150.000000

mean 5.843333 3.054000 3.758667 1.198667

std 0.828066 0.433594 1.764420 0.763161

min 4.300000 2.000000 1.000000 0.100000

25% 5.100000 2.800000 1.600000 0.300000

50% 5.800000 3.000000 4.350000 1.300000

75% 6.400000 3.300000 5.100000 1.800000

max 7.900000 4.400000 6.900000 2.500000

sepallength 0

sepalwidth 0

petallength 0

petalwidth 0

class 0

dtype: int64

sepallength sepalwidth petallength petalwidth class

113 5.7 2.5 5.0 2.0 Iris-virginica

120 6.9 3.2 5.7 2.3 Iris-virginica

116 6.5 3.0 5.5 1.8 Iris-virginica

105 7.6 3.0 6.6 2.1 Iris-virginica

37
93 5.0 2.3 3.3 1.0 Iris-versicolor

30 4.8 3.1 1.6 0.2 Iris-setosa

27 5.2 3.5 1.5 0.2 Iris-setosa

26 5.0 3.4 1.6 0.4 Iris-setosa

17 5.1 3.5 1.4 0.3 Iris-setosa

136 6.3 3.4 5.6 2.4 Iris-virginica

Index(['sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'class'], dtype='object')

sepallength sepalwidth petallength petalwidth class

0 5.1 3.5 1.4 0.2 Iris-setosa

1 4.9 3.0 1.4 0.2 Iris-setosa

2 4.7 3.2 1.3 0.2 Iris-setosa

3 4.6 3.1 1.5 0.2 Iris-setosa

4 5.0 3.6 1.4 0.2 Iris-setosa

.. ... ... ... ... ...

145 6.7 3.0 5.2 2.3 Iris-virginica

146 6.3 2.5 5.0 1.9 Iris-virginica

147 6.5 3.0 5.2 2.0 Iris-virginica

148 6.2 3.4 5.4 2.3 Iris-virginica

149 5.9 3.0 5.1 1.8 Iris-virginica

[150 rows x 5 columns]

sepallength sepalwidth petallength petalwidth class

10 5.4 3.7 1.5 0.2 Iris-setosa

11 4.8 3.4 1.6 0.2 Iris-setosa

12 4.8 3.0 1.4 0.1 Iris-setosa

13 4.3 3.0 1.1 0.1 Iris-setosa

38
14 5.8 4.0 1.2 0.2 Iris-setosa

15 5.7 4.4 1.5 0.4 Iris-setosa

16 5.4 3.9 1.3 0.4 Iris-setosa

17 5.1 3.5 1.4 0.3 Iris-setosa

18 5.7 3.8 1.7 0.3 Iris-setosa

19 5.1 3.8 1.5 0.3 Iris-setosa

20 5.4 3.4 1.7 0.2 Iris-setosa

sepallength sepalwidth petallength petalwidth class

10 5.4 3.7 1.5 0.2 Iris-setosa

11 4.8 3.4 1.6 0.2 Iris-setosa

12 4.8 3.0 1.4 0.1 Iris-setosa

13 4.3 3.0 1.1 0.1 Iris-setosa

14 5.8 4.0 1.2 0.2 Iris-setosa

15 5.7 4.4 1.5 0.4 Iris-setosa

16 5.4 3.9 1.3 0.4 Iris-setosa

17 5.1 3.5 1.4 0.3 Iris-setosa

18 5.7 3.8 1.7 0.3 Iris-setosa

19 5.1 3.8 1.5 0.3 Iris-setosa

20 5.4 3.4 1.7 0.2 Iris-setosa

Sum: 876.5

Mean: 5.843333333333335

Median: 5.8

Minimum: 4.3

Maximum: 7.9

Iris-setosa 50

39
Iris-versicolor 50

Iris-virginica 50

Name: class, dtype: int64

40
RESULT:

41
EX NO: USE THE DIABETES DATA SET FROM UCI AND PIMA INDIANS DIABETES
DATA SET FOR PERFORMING UNIVARIATE ANALYSIS, BIVARIATE
ANALYSIS AND MULTIPLE REGRESSION ANALYSIS

AIM:

PROCEDURE:

42
PROGRAM:

a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,


Skewness and Kurtosis.

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
# Reading the CSV file
df = pd.read_csv("D:\di.csv")
print(df)
#Mean
mean=df[["Pregnancies","Glucose","BP","Insulin","Diabetes","Age"]].mean()
print(mean)
#Median
median=df[["Pregnancies","Glucose","BP","Insulin","Diabetes","Age"]].median()
print(median)
#Mode
mode=df[["Pregnancies","Glucose","BP","Insulin","Diabetes","Age"]].mode()
print(mode)
#Variance
variance=df[["Pregnancies","Glucose","BP","Insulin","Diabetes","Age"]].var()
print(variance)
#StandardDeviation
sd=df[["Pregnancies","Glucose","BP","Insulin","Diabetes","Age"]].std()
print(sd)

43
b. Bivariate analysis: Linear and logistic regression modeling

#View first five rows of DataFrame


plt.scatter(df.Age,df.Glucose)
plt.title('Age vs Glucose')
plt.xlabel('Age')
plt.ylabel('Glucose')

c. Multiple Regression analysis


d. Also compare the results of the above analysis for the two data sets.

#print(df.corr()) #Correlation Coefficient


y=df['Glucose']
x=df[['Age']]
x=sm.add_constant(x)
model=sm.OLS(y,x).fit()
print(model.summary())

OUTPUT:

Pregnancies Glucose BP Insulin Diabetes Age

0 2 148 72 0 0.627 50

1 1 85 66 0 0.351 31

2 3 183 64 0 0.672 32

3 0 89 66 94 0.167 21

4 3 137 40 168 2.288 33

5 2 116 74 0 0.201 30

6 0 78 50 88 0.248 26

7 5 115 0 0 0.134 29

8 4 197 70 543 0.158 53

9 1 166 96 0 0.232 54

44
Pregnancies 2.1000

Glucose 131.4000

BP 59.8000

Insulin 89.3000

Diabetes 0.5078

Age 35.9000

dtype: float64

Pregnancies 2.00

Glucose 126.50

BP 66.00

Insulin 0.00

Diabetes 0.24

Age 31.50

dtype: float64

Pregnancies Glucose BP Insulin Diabetes Age

0 0.0 78 66.0 0.0 0.134 21

1 1.0 85 NaN NaN 0.158 26

2 2.0 89 NaN NaN 0.167 29

3 3.0 115 NaN NaN 0.201 30

4 NaN 116 NaN NaN 0.232 31

5 NaN 137 NaN NaN 0.248 32

6 NaN 148 NaN NaN 0.351 33

7 NaN 166 NaN NaN 0.627 50

8 NaN 183 NaN NaN 0.672 53

9 NaN 197 NaN NaN 2.288 54

45
Pregnancies 2.766667

Glucose 1753.155556

BP 658.177778

Insulin 28878.677778

Diabetes 0.427865

Age 140.988889

dtype: float64

Pregnancies 1.663330

Glucose 41.870700

BP 25.654976

Insulin 169.937276

Diabetes 0.654114

Age 11.873874

dtype: float64

C:\Users\NIRMALKUMAR\anaconda3\lib\site-packages\scipy\stats\stats.py:1541:
UserWarning: kurtosistest only valid for n>=20 .... continuing anyway, n=10

warnings.warn("kurtosistest only valid for n>=20 .... continuing "

OLS Regression Results

====================================================================
==========

Dep. Variable: Glucose R-squared: 0.563

Model: OLS Adj. R-squared: 0.508

Method: Least Squares F-statistic: 10.29

Date: Sat, 12 Nov 2022 Prob (F-statistic): 0.0125

Time: 19:10:53 Log-Likelihood: -46.873

No. Observations: 10 AIC: 97.75

46
Df Residuals: 8 BIC: 98.35

Df Model: 1

Covariance Type: nonrobust

====================================================================
==========

coef std err t P>|t| [0.025 0.975]

------------------------------------------------------------------------------

const 36.4400 31.021 1.175 0.274 -35.095 107.975

Age 2.6451 0.824 3.208 0.012 0.744 4.546

====================================================================
==========

Omnibus: 4.877 Durbin-Watson: 2.460

Prob(Omnibus): 0.087 Jarque-Bera (JB): 1.759

Skew: 0.990 Prob(JB): 0.415

Kurtosis: 3.552 Cond. No. 126.

====================================================================
==========

47
RESULT:

48
EX NO: APPLY AND EXPLORE VARIOUS PLOTTING FUNCTIONS ON UCI DATA
SETS.

AIM:

PROCEDURE:

49
PROGRAM

a. Normal curves

# Importing required libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

plt.style.use('seaborn-whitegrid')

# Creating a series of data of in range of 1-100.

x = np.linspace(1,100,50)

#Creating a Function.

def normal_dist(x , mean , sd):

prob_density = (np.pi*sd) * np.exp(-0.5*((x-mean)/sd)**2)

return prob_density

#Calculate mean and Standard deviation.

mean = np.mean(x)

sd = np.std(x)

#Apply function to the data.

pdf = normal_dist(x,mean,sd)

#Plotting the Results

plt.plot(x,pdf , color = 'red')

plt.xlabel('Data points')

plt.ylabel('Probability Density')

50
b. Density and contour plots

#%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
import numpy as np
def f(x, y):
return np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 40)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
plt.contour(X, Y, Z, colors='black') #Visualizing three-dimensional data with contours
plt.contour(X, Y, Z, 20, cmap='RdGy') #Visualizing three-dimensional data with colored
contours

c. Correlation and scatter plots

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

rand=np.random.RandomState(10)

x=rand.randint(100,size=20)

y = np.sin(x)

plt.plot(x, y, 'o', color='black')

51
d. Histograms

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
rand=np.random.RandomState(0)
x=rand.randint(10,size=5)
plt.hist(x)

e. Three dimensional plotting

from mpl_toolkits import mplot3d


#%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
ax = plt.axes(projection='3d')
# Data for a three-dimensional line
zline = np.linspace(0, 15, 100)
xline = np.sin(zline)
yline = np.cos(zline)
ax.plot3D(xline, yline, zline, 'gray')
# Data for three-dimensional scattered points
zdata = 15 * np.random.random(100)
xdata = np.sin(zdata) + 0.1 * np.random.randn(100)
ydata = np.cos(zdata) + 0.1 * np.random.randn(100)
ax.scatter3D(xdata, ydata, zdata, c=zdata, cmap='Greens')

52
OUTPUT:

a) Normal Curve

b) Density and Contour Plots

53
c) Correlation and Scatter Plots

54
d) Histogram

e) Three dimensional plotting

55
RESULT:

56
EX NO:
VISUALIZING GEOGRAPHIC DATA WITH BASEMAP

AIM:

PROCEDURE:

57
Basemap() Package Installation
Installation of Basemap is straightforward; if you’re using conda you can type this and the
package will be downloaded:

conda install -c anaconda basemap

Description

Basemap toolkit is a library for plotting 2D data on maps in Python. It is similar in functionality
to the matlab mapping toolbox, the IDL mapping facilities, GrADS, or the Generic Mapping
Tools.

58
59
PROGRAM:

from mpl_toolkits.basemap import Basemap


import matplotlib.pyplot as plt
fig = plt.figure(figsize = (12,12))
m = Basemap()
#Draw coastlines
m.drawcoastlines()
plt.title("Coastlines", fontsize=20)
plt.show()
#Draw Country boundaries
m.drawcoastlines(linewidth=1.0, linestyle='solid', color='black')
m.drawcountries()
plt.title("Country boundaries", fontsize=20)
plt.show()
#Draw major rivers
m.drawcoastlines(linewidth=1.0, linestyle='solid', color='black')
m.drawcountries(linewidth=1.0, linestyle='solid', color='k')
m.drawrivers(linewidth=0.5, linestyle='solid', color='#0000ff')
plt.title("Major rivers", fontsize=20)
plt.show()
#Filled map boundary
m.drawcoastlines(linewidth=1.0, linestyle='solid', color='black')
m.drawcountries(linewidth=1.0, linestyle='solid', color='k')
m.fillcontinents(color='coral',lake_color='aqua')
m.drawmapboundary(color='b', linewidth=2.0, fill_color='aqua')
plt.title("Filled map boundary", fontsize=20)
plt.show()
#Orthographic Projection
fig = plt.figure(figsize = (10,8))
m = Basemap(projection='ortho', lon_0 = 25, lat_0 = 10)

60
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='lightblue')
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
m.drawmapboundary(fill_color='lightblue')
plt.title("Orthographic Projection", fontsize=18)

OUTPUT:

61
62
63
RESULT:

64

You might also like