[go: up one dir, main page]

0% found this document useful (0 votes)
28 views109 pages

Data Science Papers

The document outlines a test for the Data Science course at SRM Institute of Science and Technology for the academic year 2024-25, detailing the structure, including course outcomes, question types, and marks distribution. It includes questions on data manipulation using Python libraries such as NumPy and Pandas, as well as concepts like web scraping and data acquisition methods. The test is divided into two parts, with Part A focusing on specific coding tasks and Part B on broader data science concepts and processes.

Uploaded by

sj3035
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views109 pages

Data Science Papers

The document outlines a test for the Data Science course at SRM Institute of Science and Technology for the academic year 2024-25, detailing the structure, including course outcomes, question types, and marks distribution. It includes questions on data manipulation using Python libraries such as NumPy and Pandas, as well as concepts like web scraping and data acquisition methods. The test is divided into two parts, with Part A focusing on specific coding tasks and Part B on broader data science concepts and processes.

Uploaded by

sj3035
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

RegisterNum

ber

SRM Institute of Science and Set- A


TechnologyCollegeofEngineeringandTechnology
SchoolofComputing
SRMNagar,Kattankulathur–
603203,ChengalpattuDistrict,TamilNadu
AcademicYear:2024-25(Even)
Test: FT1 Date:25-02-2025
CourseCode&Title:21CSS303T-Data Science Duration:50 Minutes
Year& Sem: IIIYear /VISem Max.Marks:25

CourseArticulationMatrix:
Course
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
CO1 - - - - 1 - - - - - - -
CO2 - - - - 1 - - - - - - -
Note: CO1 - To understand the relationship between data
CO2 - Identify the different data structures to represent data
Part– A
(5x2= 10 Marks)
Answer ALL the questions
Q.No Question Marks BL CO PO PI.Code
1 How do you concatenate two Numpy arrays along a 2 2 1 5 5.4.1
specified axis?
Use numpy.concatenate() to concatenate two
NumPy arrays along a specified axis.
Ex code:
import numpy as np
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6]])
result = np.concatenate((arr1, arr2), axis=0)
# Concatenates along rows
print(result)

2 How can you filter rows of a Pandas DataFrame based on 2 3 1 5 5.4.1


a condition?
use the .loc[] or boolean indexing method in Pandas.
Ex Code:
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30,
35]}
df = pd.DataFrame(data)
filtered_df = df[df['Age'] > 28]
# Selects rows where Age > 28
print(filtered_df)
3 Write a python program to get the positions of items of 2 3 2 5 5.4.2
ser2 in ser1 as a list.
Input:
ser1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13])
ser2 = pd.Series([1, 3, 10, 13])
Code:
import pandas as pd
ser1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13])
ser2 = pd.Series([1, 3, 10, 13])
positions = [ser1[ser1 == val].index[0] for val in ser2]
print(positions) # Output: [5, 4, 0, 8]
4 What is the difference between a Pandas Series and a 2 1 2 5 5.4.1
DataFrame?
 Pandas Series: A one-dimensional labeled array
that can hold any data type (like a column in a table).
 Pandas DataFrame: A two-dimensional table-like
structure with labeled rows and columns (like a
spreadsheet).
5 What is Web Scraping? Explain the steps involved with an 2 1 1 5 5.6.1
example.
Web Scraping is the process of extracting data from
websites using automated scripts.
Steps:
1. Send an HTTP request to the website.
2. Parse the HTML content.
3. Extract the required information.
4. Store the data in a structured format (CSV,
database, etc.).
Ex Code:
import requests
from bs4 import BeautifulSoup
url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
print(soup.title.text) # Extracts page title
Part– B
(3x5= 15 Marks)

Q.No Question Marks BL CO PO PI.Code


1. Imagine you are working as a Data Scientist for an e- 5 3 1 5 5.4.1
commerce company that wants to improve customer
satisfaction by analyzing user behavior on their
platform. Your task is to collect and analyze data to
identify patterns that impact customer experience and
purchase decisions. Brief the different phases involved
in your assignment.

1. Phases involved in data analysis for


customer behavior in an e-commerce
platform

When working as a Data Scientist for an e-


commerce company, analyzing user behavior
involves multiple phases:

1. Data Collection:

 Gather data from various sources, such as


user clicks, product views, purchase history,
and customer reviews.
 Data can be obtained from databases, web
logs, APIs, or third-party sources.

2. Data Cleaning and Preprocessing:

 Handle missing values, duplicate records, and


incorrect data.
 Standardize formats (e.g., date formats,
categorical values).
 Remove irrelevant or noisy data (e.g., bot-
generated interactions).

3. Exploratory Data Analysis (EDA):

 Use statistical methods and visualizations to


identify trends and patterns.
 Example: Finding which products are
frequently bought together.
 Tools: Pandas, Matplotlib, Seaborn for data
exploration.

4. Feature Engineering and Data


Transformation:

 Extract meaningful features from raw data.


 Example: Creating a "customer lifetime
value" feature based on past purchases.
 Convert categorical data into numerical
format for machine learning models.

5. Model Building and Analysis:

 Apply machine learning algorithms (e.g.,


clustering for customer segmentation,
recommendation systems for personalized
shopping).
 Example: Predicting which users are likely to
abandon their cart.
 Use Scikit-Learn, TensorFlow, or PyTorch
for modeling.

6. Visualization and Reporting:

 Present insights using dashboards, reports,


and visualizations.
 Example: Using Tableau or Power BI to
display sales trends.
 Helps stakeholders make data-driven
decisions.

By following these phases, an e-commerce company


can improve customer satisfaction and increase sales
through better user experience.

2. Explain the following Numpy operations with an example 5 2 2 5 5.4.2


 Indexing of array
 Slicing of array
 Reshaping of array
Joining and splitting of arrays

3. Describe various ways of data acquisition. Discuss the 5 2 2 5 5.4.1


significance of Web APIs, Open Data Sources, and Web
Scraping with practical examples.
1. Manual Entry:
 Data is manually collected from surveys, reports,
or research papers.
 Suitable for small datasets but time-consuming
for large-scale analysis.
2. Database Queries:
 Extracting data from relational databases like
MySQL, PostgreSQL.
 Example SQL query:
 Used for structured and historical data analysis.

3. Web APIs (Application Programming Interfaces):


 APIs provide programmatic access to data from
various platforms.
 Example: Fetching weather data using an API.
 Used in automation, machine learning
applications, and real-time data analysis.
4. Open Data Sources:
 Government and research institutions provide
free datasets.
 Example sources:
o Kaggle
(https://www.kaggle.com/datasets)
o UCI Machine Learning Repository
o Google Dataset Search
 Used in academic research, public policy
analysis, and training machine learning models.
5. Web Scraping:
 Extracts data from websites automatically.
 Steps involved:
1. Send an HTTP request to the website.
2. Parse the HTML content using
BeautifulSoup.
3. Extract relevant information.
4. Store the data in CSV, database, etc.

Course Outcome (CO)andBloom’s level (BL)Coverage in Questions

BL COVERAGE PERCENTAGE
CO Coverage
BL1
Percentage 16%
15 BL3
36%
10

5 Percentage

0
BL2
CO1 CO2
48%
RegisterNum
ber

SRM Institute of Science and Set-B


TechnologyCollegeofEngineeringandTechnology
SchoolofComputing
SRMNagar,Kattankulathur–
603203,ChengalpattuDistrict,TamilNadu
AcademicYear:2024-25(Even)
Test: FT1 Date:25-02-2025
CourseCode&Title:21CSS303T-Data Science Duration:50 Minutes
Year& Sem: IIIYear /VISem Max.Marks:25

CourseArticulationMatrix:
Course
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
CO1 - - - - 1 - - - - - - -
CO2 - - - - 1 - - - - - - -
Note: CO1 - To understand the relationship between data
CO2 - Identify the different data structures to represent data
Part– A
(5x2= 10 Marks)
Answer ALL the questions
Q.No Question Marks BL CO PO PI.Code
1 Given the NumPy array arr = np.array([[1, 2, 3], [4, 5, 6], 2 3 2 5 5.4.2
[7, 8, 9]]), write the code to extract the second column as a
1D array.

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6],


[7, 8, 9]])

# Extract the second column (index 1)


as a 1D array
second_column = arr[:, 1]

print(second_column)

Output:
[2 5 8]
2 How do you select a column from a Pandas DataFrame? 2 1 2 5 5.4.1
Write the code.

We can select a column from a Pandas DataFrame using


its column name.
import pandas as pd

# Create a DataFrame
data = {'A': [1, 4, 7], 'B': [2, 5, 8],
'C': [3, 6, 9]}
df = pd.DataFrame(data)

# Select column 'B' as a Series


column_b = df['B']

print(column_b)
Output
0 2
1 5
2 8
Name: B, dtype: int64
3 Mention two sources from which data can be acquired for 2 1 1 5 5.5.1
analysis.
Two common sources from which data can be acquired for
analysis are:
1. Web APIs
o Many online services provide APIs to
fetch structured data in formats like JSON
or XML.
o Example: Twitter API for social media
analysis, OpenWeather API for weather
data, and financial APIs for stock market
data.
2. Public Datasets and Open Data Portals
o Governments, research organizations, and
companies provide free datasets for public
use.
o Example: Kaggle
(https://www.kaggle.com/datasets),
Google Dataset Search, and UCI Machine
Learning Repository
4 Write a Python program to add, subtract, multiply and 2 2 1 5 5.4.2
divide two Pandas Series
Sample Series: [2, 4, 6, 8, 10], [1, 3, 5, 7, 9]

import pandas as pd

# Create two Pandas Series


series1 = pd.Series([2, 4, 6, 8, 10])
series2 = pd.Series([1, 3, 5, 7, 9])

# Perform arithmetic operations


addition = series1 + series2
subtraction = series1 - series2
multiplication = series1 * series2
division = series1 / series2 # This
will perform element-wise division

# Display results
print("Addition:\n", addition)
print("\nSubtraction:\n", subtraction)
print("\nMultiplication:\n",
multiplication)
print("\nDivision:\n", division)

Output
Addition:
0 3
1 7
2 11
3 15
4 19
dtype: int64

Subtraction:
0 1
1 1
2 1
3 1
4 1
dtype: int64

Multiplication:
0 2
1 12
2 30
3 56
4 90
dtype: int64

Division:
0 2.000000
1 1.333333
2 1.200000
3 1.142857
4 1.111111
dtype: float64
5 What are Web APIs and how are they used in Data 2 2 1 5 5.4.1
Acquisition?

Web APIs (Application Programming Interfaces) are a set


of rules and protocols that allow different software
applications to communicate with each other over the
internet. They enable applications to request and exchange
data, typically in a structured format like JSON or XML.

In the context of Data Acquisition, Web APIs are used to


retrieve or send data from one system to another, allowing
for the automation of data collection from remote sources,
such as databases, external systems, or online services

Part– B
(3x5= 15 Marks)

Q.No Question Marks BL CO PO PI.Code


1 Explain the complete Data Science Process in detail with 5 2 1 5 5.4.1
suitable real-world examples.

(Diagram - 1 mark)
Explanation of each stage (4 marks)

2 You're tasked with exploring a large dataset using Pandas. You 5 2 2 5 5.5.1
suspect there might be a relationship between two columns:
'age' (numerical) and 'purchase_category' (categorical).
Describe how you would use Pandas to investigate this potential
relationship. Mention TWO specific Pandas functions you
would use and explain their purpose in this context."

To explore the relationship between 'age' (numerical) and


'purchase_category' (categorical), I would use the following
two Pandas functions:
1. groupby() (2.5 marks)
 This function allows us to group data based on the
categorical column ('purchase_category') and then
compute summary statistics for the numerical column
('age').
 Purpose: It helps in understanding the distribution of
ages across different purchase categories.
 Example Usage:
import pandas as pd

# Sample DataFrame
data = {'age': [25, 34, 45, 23, 41, 36, 29, 50],
'purchase_category': ['Electronics', 'Clothing', 'Electronics',
'Books', 'Books', 'Clothing', 'Electronics', 'Books']}

df = pd.DataFrame(data)

2. value_counts() (on grouped data) (2.5 marks)


 Purpose: Helps count the occurrences of different
purchase categories within specific age groups to
identify buying patterns.
 Example Usage:
# Create age bins
df['age_group'] = pd.cut(df['age'], bins=[20, 30, 40, 50, 60],
labels=['20-30', '30-40', '40-50', '50-60'])

# Count how many purchases are made in each category within


age groups
purchase_counts =
df.groupby('age_group')['purchase_category'].value_counts()
print(purchase_counts)

3 You are developing a price comparison tool to track the 5 3 2 5 5.5.1


price of a specific product (e.g., "iPhone 15" or "Samsung
Galaxy S23") from multiple e-commerce websites such as
Amazon, eBay, and Walmart. Explain the key steps
involved in performing web scraping for this task,
covering aspects such as identifying the target websites,
extracting the relevant data, handling dynamic content,
and storing the collected information for further analysis.

Step 1: Identifying Target Websites (1 mark)


 Choose e-commerce platforms to track prices from,
such as Amazon, eBay, Walmart, etc.
 Analyze the website structure by inspecting
product pages to find relevant elements (e.g., price,
product name, availability).
 Ensure that scraping these sites complies with their
Terms of Service to avoid legal issues.

Step 2: Extracting Relevant Data (1 mark)


To extract product information, we need:
 Product name
 Price
 Availability
 Seller information
 Product URL

Step 3: Handling Dynamic Content (JavaScript-Rendered


Websites) (1 mark)
 Some websites dynamically load prices using
JavaScript, making BeautifulSoup insufficient.
 Solution: Use Selenium or Scrapy to simulate user
interaction and fetch content.

Step 4: Storing Collected Data(1 mark)


The extracted data should be stored for further analysis.

Step 5: Automating Price Tracking (1 mark)


 Use scheduled tasks (cron jobs on Linux, Task
Scheduler on Windows) to run the scraper at intervals
(e.g., daily).
 Send email alerts when price drops below a threshold.

Course Outcome (CO)andBloom’s level (BL)Coverage in Questions

BL COVERAGE PERCENTAGE
CO Coverage
BL1
Percentage 16%
15 BL3
36%
10

5 Percentage

0
BL2
CO1 CO2
48%
RegisterNum
ber

SRM Institute of Science and Set-C


TechnologyCollegeofEngineeringandTechnology
SchoolofComputing
SRMNagar,Kattankulathur–
603203,ChengalpattuDistrict,TamilNadu
AcademicYear:2024-25(Even)
Test:FT1 Date:25-02-2025
CourseCode&Title:21CSS303T-Data Science Duration:50 Minutes
Year& Sem: IIIYear /VISem Max.Marks:25

CourseArticulationMatrix:
Course
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
- - - - 1 - - - - - - -
CO1
CO2 - - - - 1 - - - - - - -
Note: CO1 - To understand the relationship between data
CO2 - Identify the different data structures to represent data
Part– A
(5x2= 10 Marks)
Answer ALL the questions
Q.No Question Marks BL CO PO PI.Code
1 What is the goal of the "exploratory data analysis" phase? 2 1 1 5 5.6.1
2 Write the syntax to create a 1D NumPy array from a Python 2 1 1 5 5.6.1
list.
3 Why are NumPy arrays more efficient than Python lists for 2 2 1 5 5.4.1
numerical operations?
4 Compare a Python list and a Pandas Series? 2 2 2 5 5.4.1
5 How would you display the first five rows of a DataFrame? 2 2 2 5 5.4.1

Part– B
(3x5= 15 Marks)

Q.N Question Marks BL CO PO PI.Code


o
1 Explain the different facets of data in Data Science with 5 2 1 5 5.4.2
suitable examples.
2 Given the following dataset stored in sales_data.csv: 5 3 2 5 5.5.1
Product Category Sales
A Electronics 1000
B Clothing 500
C Electronics 1200
D Clothing 700
E Grocery 300
Write a Python program to:
Read the CSV file into a DataFrame
Find the total sales per category
Find the average sales per category
3 Explain different types of data acquisition techniques used in 5 2 2 5 5.4.2
Data Science.

Course Outcome (CO)andBloom’s level (BL)Coverage in Questions


BL COVERAGE PERCENTAGE
CO Coverage Percenta
ge
Percentage BL1
Percenta
15 ge 16%

10 BL3 Percenta
36% ge
5 Percentage
BL2
48%
0
CO1 CO2

Key:

1. What is the goal of the "exploratory data analysis" phase?


Exploratory Data Analysis (EDA) is an important first step in data science.Its goal is to gain insights by
looking at and visualizing data to understand its main features, find patterns, spotting anomalies, validating
assumptions and discover how different parts of the data are connected before applying any machine learning models
or statistical techniques. 2M

2. Write the syntax to create a 1D NumPy array from a Python list.

import numpy as np
// Creating a 1D NumPy array from a Python list
my_list = [1, 2, 3, 4, 5] 1M
np_array = np.array(my_list) 1M
print(np_array)

3. Why are NumPy arrays more efficient than Python lists for numerical operations?

NumPy is faster and more memory-efficient than Python lists because of contiguous memory storage,
vectorized operations(operations are applied to all elements in an array without the need for explicit loops in Python),
broadcasting, and optimized C-based backend (uses BLAS (Basic Linear Algebra Subprograms) and LAPACK(Linear
Algebra PACKage), which are highly optimized C libraries) computations. Any two explanations each 1M

4. Compare a Python list and a Pandas Series?


Feature Lists Pandas Series
Missing Must handle manually Built-in support for NaN
Values
Performance Slower Faster
Memory Higher Lower
Usage
Indexing Uses integer-based Supports custom indexing
indexing
Any two difference each 1M
5. How would you display the first five rows of a DataFrame?

first five rows of a Pandas DataFrame can be displayed using the .head() method. 2M

Part B

1. Explain the different facets of data in Data Science with suitable examples.

Very large amount of data will generate in big data and data science. These data is various types and main categories of
data are as follows:
a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
g) Audio, video and images each 1Mark with appropriate explanation (any five)

2.

Given the following dataset stored in sales_data.csv:


Product Category Sales
A Electronics 1000
B Clothing 500
C Electronics 1200
D Clothing 700
E Grocery 300
Write a Python program to:
Read the CSV file into a DataFrame
Find the total sales per category
Find the average sales per category
Ans:
import pandas as pd
df = pd.read_csv("sales_data.csv")
# Create a DataFrame
df = pd.DataFrame(data)
# Find the total sales per category
total_sales = df.groupby("Category")["Sales"].sum()
# Find the average sales per category
average_sales = df.groupby("Category")["Sales"].mean()
# Display results
print("Total Sales per Category:")
print(total_sales)
print("\nAverage Sales per Category:")
print(average_sales)

1 mark -reading csv file


2 mark -total sales per category
2 mark - average sales

3. Explain different types of data acquisition techniques used in Data Science.

Ans: Data Science primarily involve methods to collect raw data from various sources, including sensors,
databases, APIs, and manual inputs

Methods of different data collection includes primary data ans secondary data.
Primary data:
 Direct Personal Investigation:
 Indirect Oral Investigation:
 Information from Local Sources or Correspondents
 Information through Questionnaires and Schedules
 Mailing Method
 Enumerator’s Method
Any 3 methods with explanation 3 x 1 =3M

Secondary data

 Published Sources (Government Publications,Semi-Government Publications,Publications of Trade


Associations,Journals and Papers,International Publications,Publications of Research Institutions)
 Unpublished Sources (These organizations usually collect data for their self-use and are not published
anywhere.)
 Web Scraping
Any two with explanation 2 x1 = 2M
RegisterNum
ber

SRM Institute of Science and Set-D


TechnologyCollegeofEngineeringandTechnology
SchoolofComputing
SRMNagar,Kattankulathur–
603203,ChengalpattuDistrict,TamilNadu
AcademicYear:2024-25(Even)
Test:FT1 Date:25-02-2025
CourseCode&Title:21CSS303T-Data Science Duration:50 Minutes
Year& Sem: IIIYear /VISem Max.Marks:25

CourseArticulationMatrix:
Course
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
CO1 - - - - 1 - - - - - - -
CO2 - - - - 1 - - - - - - -
Note: CO1 - To understand the relationship between data
CO2 - Identify the different data structures to represent data
Part– A
(5x2= 10 Marks)
Answer ALL the questions
Q.No Question Marks BL CO PO PI.Code
What are the uses of NumPy?

NumPy is used for numerical computations in Python.


It provides support for large, multi-dimensional arrays and
1 matrices. 2 1 1 5 5.6.1
It offers mathematical functions for linear algebra,
statistical operations, and Fourier transforms.
It enhances performance due to its efficient memory usage
and vectorized operations
How do you search for a specific value in a NumPy array?
import numpy as np
arr = np.array([10, 20, 30, 40, 50])
index = np.where(arr == 30)
print(index) # Output: (array([2]),)
2 2 3 1 5 5.4.1
result = arr[arr == 30]
print(result) # Output: [30]

Which function is used to join arrays along a specific


axis?

import numpy as np
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
result = np.concatenate((a, b), axis=0)
3 print(result) 2 2 2 5 5.4.1

And
hstack()
vstack()

4 List out the advantages of web scraping. 2 2 2 5 5.6.1


Automates data collection from websites.
Helps in price comparison and market analysis.
Enables real-time data updates for applications.
Assists in sentiment analysis and business intelligence.
Extracts structured data for research purposes.
How do you sort a Pandas DataFrame based on multiple
columns? Explain with an example.

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Salary': [50000, 60000, 70000, 80000]}
5 df = pd.DataFrame(data) 2 3 2 5 5.4.1
sorted_df = df.sort_values(by=['Age', 'Salary'],
ascending=[True, False])
print(sorted_df)

This sorts first by Age in ascending order and then by


Salary in descending order.

Part– B
(3x5= 15 Marks)

Q.No Question Marks BL CO PO PI.Code


Explain the different phases in the Data Science Process.
Discuss how each phase contributes to solving a real-
world problem.

Problem Definition: Identify the objective (e.g.,


predicting sales).
Data Collection: Gather relevant data (e.g., customer
transactions).
Data Cleaning: Remove inconsistencies and handle
6 missing values. 5 2 1 5 5.4.2
Exploratory Data Analysis (EDA): Identify trends and
patterns.
Model Building: Train machine learning models.
Model Evaluation: Validate accuracy using metrics like
RMSE or accuracy score.
Deployment & Monitoring: Implement and refine based
on real-world feedback.

Description of phase to be included


You are developing a price comparison tool to track the
price of a specific product (e.g., "iPhone 15" or "Samsung
Galaxy S23") from multiple e-commerce websites such as
Amazon, eBay, and Walmart. Explain the key steps
involved in performing web scraping for this task,
covering aspects such as identifying the target websites,
extracting the relevant data, handling dynamic content, and
storing the collected information for further analysis.
7 5 3 2 5 5.5.1
Identify Target Websites: Select Amazon, eBay, Walmart,
etc.
Inspect Website Structure: Use browser developer tools to
locate price-related HTML elements.
Extract Data: Use Python libraries like BeautifulSoup and
requests:

import BeautifulSoup
import requests
url = "https://www.example.com/product"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
price = soup.find('span', {'class': 'price'}).text
print(price)

Handle Dynamic Content: Use Selenium if data is loaded


via JavaScript.
Store Data: Save in CSV, database, or cloud storage for
analysis

df = pd.DataFrame({'ID': [101, 102, 103, 104],


'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Salary': [50000, 60000, 70000, 80000]})

 Select the rows where the 'Age' is greater than 30.


 Select the 'Name' and 'Salary' columns for the
first two rows.
 Select all rows except for the last one.

import pandas as pd
df = pd.DataFrame({'ID': [101, 102, 103, 104],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
9 5 4 2 5 5.5.1
'Salary': [50000, 60000, 70000, 80000]})

# Select rows where 'Age' > 30


result1 = df[df['Age'] > 30]
print(result1)

# Select 'Name' and 'Salary' for the first two rows


result2 = df.loc[:1, ['Name', 'Salary']]
print(result2)

# Select all rows except the last one


result3 = df.iloc[:-1]
print(result3)
Register
Number

SRM Institute of Science and Technology


Set - A
College of Engineering and Technology
School of Computing
SRM Nagar, Kattankulathur – 603203, Chengalpattu District, Tamil Nadu
Academic Year: 2024-25 (EVEN)

Test: FT4 Date: 29-04-2025


Course Code & Title: 21CSS303T-Data Science Duration: Two periods
Year& Sem: III Year /VI Sem Max.Marks:50

Course Articulation Matrix:


Course PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
CO3 - - - - 1 - - - - - - -
CO4 - - - - 1 - - - - - - -
CO5 - - - - 1 - - - - - - -
Note: CO3 – To identify data manipulation and cleaning techniques using pandas
CO4 – To constructs the Graphs and plots to represent the data using python packages
CO5 – To apply the principles of the data science techniques to predict and forecast the outcome of real-
world problem
Part – A (10 x 1 = 10 Marks)
Instructions:
1) Answer ALL questions.
2) The duration for answering Part A is 15 minutes (this sheet will be collected after 15 minutes).
3) Encircle the correct answer.
Question Marks BL CO PO PI
S.No
Code
1 ----------- method is used to replace, predict, or create the missing values 1 1 3 5 1.4
.1
A.Permutation
B.Deletion
C.Imputation
D.Updation

2 melted_df = pivot_df.reset_index().melt(id_vars='Date', 1 1 3 5 1.4.1


var_name='City', value_name='Sales')
print(melted_df)
What is the purpose of the reset_index() function in the given code?
A. To rename the index
B. To drop the index entirely
C. To convert the index into a column
D. To sort the DataFrame

3 What change should be made to the following code to perform column-wise 1 1 3 5 1.4.1
concatenation? concat_df = pd.concat([df1, df2], -----------)
A.concat_df = pd.concat([df1, df2], axis=2)
B.concat_df = pd.concat([df1, df2], axis=1)
C.concat_df = pd.concat([df1, df2], axis=0)
D.concat_df = pd.concat([df1, df2], axis=’TRUE’)

4 Which of the following libraries is not primarily involved in handling large 1 2 3 5 1.4.1
volumes of data?
A. Cython
B. Numexpr
C. Numba
D.Seaborn
5 Which of the following statements is true regarding data structures? 1 2 3 5 1.4.1
A) Data structures have the same storage requirements for all types.
B) Data structures influence the performance of CRUD operations (create,
read, update, and delete).
C) Data structures only affect the storage and not the performance of
operations.
D) Data structures do not affect the performance of CRUD operations.

6 Which of the following is the correct syntax for creating a subplot with 2 1 1 4 5 1.4.1
rows and 3 columns in the first position?
A) plt.subplot(2, 3, 0)
B) plt.subplot(3, 2, 1)
C) plt.subplot(2, 3, 1)
D) plt.subplot(1, 2, 3)

7 Which parameter is used to create 100 evenly spaced values between 0 and 1 1 4 5 1.4.1
10?
A) np.linspace(0, 10, 100)
B) np.linspace(0, 100, 10)
C) np.linspace(0, 10, 100)
D)np.linspace(10, 100, 0)

8 What does the following matplotlib code do? 1 1 4 5 1.4.1


plt.annotate('Peak Point',
xy=(6, 15),
xytext=(4, 17),
fontsize=12, color='blue')
A) Adds the text Peak Point at point (6, 15) in blue font.
B) Displays Peak Point directly at point (6, 15) without any arrow.
C) Adds a blue annotation with the text Peak Point, pointing from (4, 17) to
(6, 15).
D) Draws a blue line between (4, 17) and (6, 15) and places Peak Point on it.

9 Which of the following best describes the purpose of GridSpec in data 1 2 5 5 1.4.1
visualization?
A) Group data by a categorical variable and create subplots for each
category.
B) Visualize the relationship between two variables along with their
distributions.
C) Create custom grid layouts for organizing multiple subplots.
D) Plot the relationships between all numeric column pairs in a DataFrame.

10 What does the height of a bar in a histogram represent? 1 2 5 5 1.4.1


A) The total number of data points
B) The frequency of values within the interval
C) The range of values in the data
D) The cumulative frequency of the data
Register
Number

SRM Institute of Science and Technology


College of Engineering and Technology Set -
School of Computing
SRM Nagar, Kattankulathur – 603203, Chengalpattu District, Tamil Nadu
Academic Year: 2024-25 (EVEN SEM)

Test: FT4 Date:29-04-2025


Course Code & Title: 21CSS303T-Data Science Duration: Two periods
Year& Sem: III Year /VI Sem Max.Marks:50

Part – B (4 x 5 = 20 Marks)
Instructions: Answer ANY FOUR Questions
Q. Question Marks BL CO PO PI
No Code

11 5 2 3 5 1.4.1
Write a Python program to do the following:
a. Replace all missing (NaN) values in the Name column with the
string 'Unknown'.
b. Replace all missing (NaN) values in the Age column with the
mean of the available age values.
c. Add a new column named City and fill it with any default or
custom city names for each student.
d. Print the final cleaned DataFrame.

Name Age
Bob 24
NaN 25
Sweety Nan
Rita 26

import pandas as pd
import numpy as np

# Create the initial DataFrame


data = {
'Name': ['Bob', np.nan, 'Sweety', 'Rita'],
'Age': [24, 25, np.nan, 26]
}

df = pd.DataFrame(data)

# Step 1: Replace NaN in 'Name' with 'Unknown'


df['Name'].fillna('Unknown', inplace=True)

# Step 2: Replace NaN in 'Age' with the mean age


mean_age = df['Age'].mean()
df['Age'].fillna(mean_age, inplace=True)

# Step 3: Add a new column 'City' with default value (e.g., 'Delhi')
df['City'] = ['Delhi', 'Mumbai', 'Pune', 'Chennai'] # You can customize this

# Display the cleaned DataFrame


print(df)
12 What is the purpose of the pandas.merge() function in Python?(2.5) 5 3 3 5 1.4.1
Explain its use with an example by merging two DataFrames on a
common column using python (2.5) ?
Pandas merge function
pandas.merge connects rows in DataFrames based on one or more
keys. This will be familiar to users of SQL or other relational
databases, as it implements database join operations

Sample Example

# Create the first DataFrame: student details


df_students = pd.DataFrame({
'Student_ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})

# Create the second DataFrame: student scores


df_scores = pd.DataFrame({
'Student_ID': [1, 2, 3],
'Marks': [85, 90, 88]
})

# Merge the two DataFrames on the 'Student_ID' column


merged_df = pd.merge(df_students, df_scores, on='Student_ID')
# Print the merged DataFrame
print(merged_df)

13 What is reshaping in pandas, and what are the main methods used for 5 2 3 5 1.4.1
reshaping a DataFrame?

Reshaping in pandas refers to changing the structure or layout of a


DataFrame
Pivoting Data (pivot() and pivot_table())
• Pivoting rearranges data by turning unique values into
columns.
• Melting Data (melt())
• The opposite of pivoting – it converts wide data into long
format
• Stacking (stack())
• Converts columns into a hierarchical index (multi-index
rows).

14 Write a Matplotlib program to rotate tick labels in x and y axis ? 5 3 4 5 1.4.1


import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 20, 25, 30, 40]

# Create the plot


plt.plot(x, y)

# Set labels
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.title('Rotating Tick Labels Example')
# Rotate x-axis tick labels by 45 degrees
plt.xticks(rotation=45)

# Rotate y-axis tick labels by 90 degrees


plt.yticks(rotation=90)

# Show the plot


plt.tight_layout() # Adjust layout to prevent label cutoff
plt.show()

15 What are the various types of annotations in Matplotlib? Give the syntax 5 3 5 5 1.4.1
of annotation .

• Text annotations are used to add explanatory or descriptive text


to specific points, regions, or features within a plot.
• Marker annotations involve placing markers or symbols on
specific points of interest within a plot to highlight or provide
additional information about those points.
• Callouts refer to a specific type of annotation that uses visual
elements like arrows, lines, or text to draw attention to a particular
area or feature within a plot.

Part – C (2 x 10 = 20 Marks)
Instructions: Answer ALL questions.

Q. Question Marks BL CO PO PI
No Code
16 (i)Explain outliers and their types. (5 marks) 10` Q 3 5 1.4.1
Outlier Noise or Outliers are the data points which deviate
significantly from the norm.
Outliers can be single data points, or a subset of observations called
a collective outlier.
The outlier data points can greatly impact the accuracy and
reliability of statistical analyses and machine learning models.
Outliers can also be called abnormalities, discordant, deviants, or
anomalies.

Types of outlier
Global outliers
• Global outliers are isolated data points that are far away
from the main body of the data.
• They are often easy to identify and remove.
Contextual outliers
• Contextual outliers are data points that are unusual in a
specific context but may not be outliers in a different
context.
They are often more difficult to identify and may require
additional information or domain knowledge to determine
their significance
(ii)We create a panda DataFrame from a dictionary that holds the
student data. student's ID, first name, last name, and grade(5Marks)
a. Combine First Name and Last Name into a new column called Full
Name.
b. Display only the First Name and Grade columns.
c. Identify and display students who received a grade 'A'.
d. Create a new column Updated Grade, where every 'B' grade is
replaced with 'A'.

import pandas as pd

# Create a dictionary of student data


data = {
'ID': [101, 102, 103, 104, 105],
'First Name': ['John', 'Jane', 'Jim', 'Jill', 'Jack'],
'Last Name': ['Doe', 'Smith', 'Beam', 'Hill', 'Black'],
'Grade': ['A', 'B', 'A', 'C', 'B']
}

# Create a DataFrame from the dictionary


students = pd.DataFrame(data)

# a. Combine First Name and Last Name into a new column called Full
Name
students['Full Name'] = students['First Name'] + ' ' + students['Last
Name']
print(" DataFrame with Full Name:\n", students, "\n")

# b. Display only the First Name and Grade columns


print("First Name and Grade columns:\n", students[['First Name',
'Grade']], "\n")

# c. Identify and display students who received a grade 'A'


grade_A_students = students[students['Grade'] == 'A']
print("Students with Grade 'A':\n", grade_A_students, "\n")

# d. Create a new column Updated Grade, where every 'B' grade is


replaced with 'A'
students['Updated Grade'] = students['Grade'].replace('B', 'A')
print("DataFrame with Updated Grade:\n", students)
(OR)

16 b (i)Explain about standardization with its type. (5 mark) 10 3 3 5 1.4.1


• Standardization is a common preprocessing technique in
data science that transforms numerical data to have a mean
of 0 and a standard deviation of 1.
• This is particularly useful when dealing with features that
have different scales or units, as it ensures that all features
contribute equally to the model.
• Equalizes Feature Importance: Standardization prevents
features with larger magnitudes from dominating the
model, ensuring that all features are treated fairly.
• Improves Model Performance: Many machine learning
algorithms, especially those based on distance or gradient
calculations, benefit from standardized data.
• Compatibility with Certain Algorithms: Some algorithms,
like K-Nearest Neighbors and Support Vector Machines,
assume standardized data.

Standardization

Z-Score Min-max

• Z-score normalization is a data preprocessing technique


that transforms numerical data to have a mean of 0 and a
standard deviation of 1. This is particularly useful when
dealing with features that have different scales or units, as
it ensures that all features contribute equally to the model.
• The formula used is:
z = (x - mean) / standard_deviation
• where:
• z is the normalized value.
• x is the original value.
• mean is the mean of the dataset.
• standard_deviation is the standard deviation of the
dataset.
• Min-max normalization is a data preprocessing technique
that scales numerical data to a specific range, typically
between 0 and 1. It's useful when you want to preserve the
relative distances between data points while ensuring that
all features have a similar scale.
• The formula used is:
x_scaled = (x - min(x)) / (max(x) - min(x))
where:
• x_scaled is the normalized value.
• x is the original value.
• min(x) is the minimum value in the dataset.
• max(x) is the maximum value in the dataset.

(ii)You are provided with a dataset of student records in a Python


program using the panda’s library. The dataset includes the following
fields: ID, First Name, Last Name, and Grade. Write a Python program
to perform the following tasks (5 marks)

a. Convert the First Name column to uppercase and store it in a new


column Full Name Upper.
b. Create a new column name is Formatted Info which contain the
information in the format : "Full Name, Grade: <Grade>".
c. Count and display the number of students who originally received a
grade 'B'.
d. Calculate and display the length (number of characters) of each
student's full name in a new column called Full Name Length

import pandas as pd

# Sample dataset
data = {
'ID': [1, 2, 3, 4],
'First Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Last Name': ['Smith', 'Jones', 'Brown', 'Taylor'],
'Grade': ['A', 'B', 'B', 'C']
}

# Create DataFrame
df = pd.DataFrame(data)

# a. Convert First Name to uppercase and store in new column 'Full


Name Upper'
df['Full Name Upper'] = df['First Name'].str.upper() + " " + df['Last
Name'].str.upper()

# b. Create 'Formatted Info' column


df['Formatted Info'] = df['First Name'] + " " + df['Last Name'] + ",
Grade: " + df['Grade']

# c. Count and display number of students with grade 'B'


count_b = (df['Grade'] == 'B').sum()
print(f"Number of students who received grade 'B': {count_b}")

# d. Calculate and store the length of full names


df['Full Name Length'] = (df['First Name'] + " " + df['Last
Name']).str.len()

# Display the updated DataFrame


print("\nUpdated DataFrame:")
print(df)

10 2 4 5 1.41
17 a Explain the features of seaborn(5)
• Statistical Graphics: Seaborn is specifically designed for
creating statistical graphics, providing built-in functions for
common visualizations like scatter plots, line plots, histograms,
and more. This makes it easier to create visually appealing and
informative plots for data analysis.
• Data Visualization Themes: Seaborn offers pre-defined styles
and themes that can quickly change the overall appearance of
your plots. This helps create consistent and aesthetically
pleasing visualizations without requiring extensive
customization.
• Integration with Pandas and NumPy: Seaborn seamlessly
integrates with Pandas and NumPy, making it easy to work
with dataframes and arrays directly. This simplifies the
workflow and reduces the amount of code needed for data
analysis and visualization.
• FacetGrid and Pair Plots: Seaborn provides FacetGrid for
grouping data and creating subplots based on categorical
variables. This is useful for comparing distributions or
relationships across different groups. Pair plots allow you to
visualize the relationships between all pairs of numeric
columns in a DataFrame, helping you identify correlations and
patterns.
• Customization and Flexibility: While Seaborn provides a
high-level interface, it's built on top of Matplotlib, giving you
access to its extensive customization options. This allows you
to fine-tune your plots to meet your specific needs.
• Ease of Use: Seaborn's API is designed to be user-friendly and
intuitive, making it easier to learn and use compared to
Matplotlib. Its documentation is also well-written and provides
clear examples.

Illustrate the Python program uses Matplotlib to compare sales of two


products over six months using a line chart.
Each product is represented with a unique marker and labeled for
clarity. A legend is added to distinguish between the products visually.
Ticks are used for specify the range of values(5)

import matplotlib.pyplot as plt

# Sample data
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
product_a_sales = [20, 35, 30, 35, 27, 40]
product_b_sales = [25, 32, 34, 20, 25, 30]

# Create the plot


plt.plot(months, product_a_sales, marker='o', label='Product A') #
Circle markers
plt.plot(months, product_b_sales, marker='s', label='Product B') #
Square markers

# Title and labels


plt.title('Sales Comparison Over 6 Months')
plt.xlabel('Month')
plt.ylabel('Sales')

# Set y-axis ticks at every 5 units


plt.yticks(range(0, 51, 5))

# Add a legend
plt.legend()

# Show the plot


plt.grid(True)
plt.show()

(OR)
17 b 10 3 5 5 1.4.1
Give your own Seaborn library example for a 3D line plot, 3D scatter
plot, and 3D surface plot. Draw the output for each example.
import seaborn as sns # Not directly used for surface plots
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
# Sample data (ensure x and y are 2D for surface plot)
x = np.linspace(0, 5, 10) # Create equally spaced points from 0 to 5
with 10 elements
y = np.linspace(0, 5, 10)
X, Y = np.meshgrid(x, y) # Create a 2D grid from x and y for surface
evaluation
def f(x, y):
return x**2 + y**2 # Replace with your desired function
# Calculate z values based on the function
z = f(X, Y)
# Create a 3D figure and axes
fig = plt.figure(figsize=(8, 6)) # Adjust figure size as needed
ax = fig.add_subplot(111, projection='3d')
surf = ax.plot_surface(X, Y, z, cmap='viridis', linewidth=0,
antialiased=True) # Adjust colormap
ax.set_xlabel('X-axis')
ax.set_ylabel('Y-axis')
plt.title('3D Surface Plot’)
# Customize viewing angle (optional)
ax.view_init(elev=20, azim=45) # Adjust elevation and azimuth
angles
# Show the plot
plt.show()
Course Outcome (CO) and Bloom’s level (BL) Coverage in Questions

CO Coverage
60% 55%

50% 45%

40%

30%

20%

10%

0%
Register
Number

SRM Institute of Science and Technology


Set -
College of Engineering and Technology
School of Computing
SRM Nagar, Kattankulathur – 603203, Chengalpattu District, Tamil Nadu
Academic Year: 2024-25 (EVEN)

Test: FT4 Date: 29-04-2025


Course Code & Title: 21CSS303T-Data Science Duration: Two periods
Year& Sem: III Year /VI Sem Max.Marks:50

Course Articulation Matrix:


Course PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
CO3 - - - - 1 - - - - - - -
CO4 - - - - 1 - - - - - - -
CO5 - - - - 1 - - - - - - -
Note: CO3 – To identify data manipulation and cleaning techniques using pandas
CO4 – To constructs the Graphs and plots to represent the data using python packages
CO5 – To apply the principles of the data science techniques to predict and forecast the outcome of real-
world problem
Part – A (10 x 1 = 10 Marks)
Instructions:
1) Answer ALL questions.
2) The duration for answering Part A is 15 minutes (this sheet will be collected after 15 minutes).
3) Encircle the correct answer.
Question Marks BL CO PO PI
S.No
Code
1 Which of the following is NOT a commonly used tool for data 1 1 3 5 2.1
wrangling? .3
a) Pandas
b) NumPy
c) Matplotlib
d) OpenRefine
2 Techniques used to handle the missing values? 1 1 3 5 2.1.3
a) permutation and imputation
b) insertion and deletion
c) imputation and deletion
d) insertion and deletion
3 A cricket analyst is looking at a player's scores from 10 matches: 1 2 3 5 2.1.3
45, 50, 60, 55, 48, 52, 49, 51, 47, 200. When should the score of 200
be considered an outlier and possibly removed from analysis?

a) When it is much higher than all the other scores


b) When it is the highest score ever in cricket
c) When it helped win the match
d) When it is lower than all the other scores
4 What is the primary condition that requires reshaping or pivoting a 1 2 3 5 2.1.3
dataset?

a) When the data is already in a well-organized format with one row


for each observation.
b) When the data is in a wide format with many columns and needs to
be transformed into a fewer columns.
c) When the data contains only numerical values and no categorical
columns.
d) When the dataset is too small to analyse effectively.
5 Which is not a Data transformation technique? 1 1 3 5 2.1.3
a) Attribute Construction
b) Smoothing
c) Data augmentation
d) data Discretization
6 Which of the Following statement true about Seaborn Library in 1 1 4 5 2.1.3
python?
a) Provide High level interface with less complex syntax and default
themes
b) Provides High level interface with high level complex syntax and
customizable themes.
c) Provides interactive visualization library with complex syntax
d) Provide Interactive and web-ready visualization with no themes.
7 A teacher collected data on the number of hours students studied for a 1 2 4 5 2.1.3
math test and their corresponding test scores. She plotted this data on
a scatter plot, where the x-axis represents hours studied and the y-axis
represents test scores. The scatter plot showed a cluster of points that
generally increased from left to right.
a) There is a negative correlation between hours studied and test
scores.
b) The scatter plot shows no relationship between hours studied and
test scores.
c) There is a positive correlation between hours studied and test
scores.
d) Students who studied less always scored higher than those who
studied more.
8 Which situation appropriate to use 3D plot? 1 2 4 5 2.1.3

a) It allows the analyst to ignore one of the variables and focus only
on two.
b) A 3D plot helps display the relationship between all three variables
simultaneously.
c) 3D plots are only used for representing time series data.
d) It makes the data look more attractive, even if it doesn’t add any
analytical value.
9 A school wants to compare the math test scores of students from three 1 2 5 5 2.1.3
different classes (Class A, Class B, and Class C). The data science
teacher uses Matplotlib to create a box plot for each class.What is the
main reason for using a box plot in this situation?

a) To show the relationship between two continuous variables.


b) To compare the spread, central tendency, and outliers of scores
across the three classes.
c) To identify the exact score of each student in each class.
d) To visualize the trend of test scores over time.
Register
Number

SRM Institute of Science and Technology


Set -
College of Engineering and Technology
School of Computing
SRM Nagar, Kattankulathur – 603203, Chengalpattu District, Tamil Nadu
Academic Year: 2024-25 (EVEN SEM)

Test: FT4 Date:29-04-2025


Course Code & Title: 21CSS303T-Data Science Duration: Two periods
Year& Sem: III Year /VI Sem Max.Marks:50

Part – B (4 x 5 = 20 Marks)
Instructions: Answer ANY FOUR Questions
Q. Question Marks BL CO PO PI
No Code

11 List out different approaches used to combine different datasets with 5 2 3 5 2.1.2
example.

Approaches to Combine Different Datasets:

1. Concatenation (Vertical/Horizontal)
o Example: pd.concat([df1, df2], axis=0) (vertical) or
pd.concat([df1, df2], axis=1) (horizontal)
2. Merging (SQL-style joins)
o Example: pd.merge(df1, df2, on='common_column',
how='inner')
3. Joining
oExample: df1.join(df2, on='common_column', how='left')
4. Appending
o Example: df1.append(df2, ignore_index=True)
5. Union
o Example: Combining rows from two datasets with the
same columns: pd.concat([df1, df2], axis=0,
ignore_index=True)
6. Cross Join
o Example: Using a Cartesian product to combine
datasets: df1.merge(df2, how='cross')
7. Concatenation by Index
o Example: df1.append(df2, ignore_index=False)

12 What are the conditions used to choose the data binning techniques with 5 3 3 5 2.1.2
example?

Conditions to Choose Data Binning Techniques:

1. Nature of Data
o Uniform Data: Equal-width Binning
o Skewed Data: Equal-frequency Binning
2. Number of Bins
o Fixed Number of Bins: Equal-width or Equal-frequency
Binning
o Adaptive Binning: Custom Binning or Clustering-based
Binning
3. Distribution of Data
o Normal Distribution: Equal-width Binning
o Non-Normal Distribution: Equal-frequency Binning
4. Handling Outliers
o Outlier-prone Data: Adaptive Binning or Clustering-
based Binning
5. Interpretability of Bins
o Interpretable Bins: Custom Binning based on Domain
Knowledge

Example: Equal-width: import numpy as np


data = np.random.normal(0, 1, 1000)
bins = np.linspace(min(data), max(data), 6) # 5 equal-width bins

13 What are the methods used to categorize the Noise and Outliers in the 5 2 3 5 2.2.3
dataset?

Methods to Categorize Noise and Outliers in a Dataset:

1. Statistical Methods:
o Z-Score (Standard Deviation Method)
o IQR (Interquartile Range) Method
o Modified Z-Score
2. Visual Methods:
o Box Plot
o Scatter Plot
o Histogram
3. Machine Learning Methods:
o DBSCAN (Density-Based Spatial Clustering of
Applications with Noise)
o Isolation Forest
o One-Class SVM
4. Domain Knowledge:
o Expert-defined thresholds or rules for outlier detection
5. Proximity-Based Methods:
o k-Nearest Neighbors (k-NN)
o Local Outlier Factor (LOF)

These methods help identify and categorize noise and outliers based on
statistical properties, clustering, or domain-specific rules.

14 Write the python code to plot 3D and Scatter plot using Matplotlib? 5 3 4 5 2.2.3
3D –Plot code:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np

# Create data
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
x, y = np.meshgrid(x, y)
z = np.sin(np.sqrt(x**2 + y**2))

# Create 3D plot
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')

# Plotting the surface


ax.plot_surface(x, y, z, cmap='viridis')
# Labels and title
ax.set_xlabel('X Axis')
ax.set_ylabel('Y Axis')
ax.set_zlabel('Z Axis')
ax.set_title('3D Surface Plot')

# Show plot
plt.show()

Scattar Plot:
# Create data for scatter plot
x = [1, 2, 3, 4, 5]
y = [5, 4, 3, 2, 1]

# Create scatter plot


plt.scatter(x, y, color='blue', label='Data points')

# Adding labels and title


plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.title('2D Scatter Plot')
plt.legend()

# Show plot
plt.show()
15 What are the different annotations used to plot the subplots in 5 3 4 5 2.2.3
Matplotlib? Give example.

Annotations in Matplotlib for Subplots

In Matplotlib, annotations are used to add text, arrows, and other


labels to a plot. Below are the different types of annotations
commonly used when plotting subplots:

plt.text()// Adds text at a specific (x, y) position on the plot.


Example: plt.text(2, 3, 'This is a point', fontsize=12, color='red')

plt.annotate()

 Adds an annotation with optional arrows, highlighting


specific data points.

Example: plt.annotate('Peak Point', xy=(3, 5), xytext=(4, 6),


arrowprops=dict(facecolor='blue', arrowstyle='->'))

ax.text()

 Similar to plt.text(), but used within a specific Axes


object for subplots.

Example: fig, ax = plt.subplots()

ax.text(2, 3, 'Subplot Text', fontsize=12, color='green')


ax.annotate()

 Used for subplots to annotate with arrows or text at a given


data point.

Example: fig, ax = plt.subplots()

ax.plot([1, 2, 3], [4, 5, 6])

ax.annotate('Max Point', xy=(3, 6), xytext=(2, 5),


arrowprops=dict(facecolor='red', arrowstyle='->'))

Explanation includes:

 plt.text() and ax.text() are used to add simple text


annotations at specific locations.
 plt.annotate() and ax.annotate() are used for adding
more complex annotations with arrows and labels pointing
to specific data points.

Part – C (2 x 10 = 20 Marks)
Instructions: Answer ALL questions.

Q. Question Marks BL CO PO PI
No Code
16 a Discuss about various Data transformation techniques in detail with 10 2 3 5 2.2.3
example.

Answer key: Discussion about transformation techniques with


example.

List of Data Transformation Techniques Used in Data Science:

 Normalization
 Standardization
 Log Transformation
 Power Transformation
 Binning (Discretization)
 Encoding Categorical Variables
 One-Hot Encoding
 Label Encoding
 Feature Scaling
 Quantile Transformation
 PCA (Principal Component Analysis)
 Polynomial Transformation
 Handling Skewed Data
 Text Vectorization (TF-IDF, Count Vectorizer)
 Date-Time Feature Extraction

(OR)

16 b Consider you are a data analyst for a smart city initiative that 10 3 5 5 3.3.1
monitors Electric Vehicle (EV) charging station usage across
different locations. Your goal is to clean, transform, and analyze
the data to optimize charging station efficiency, reduce waiting
times, and improve user experience. The dataset contains EV
charging session logs collected from multiple charging stations and
includes the following attributes: Session ID, User ID, Station ID,
Location, Charging Start Time, Charging End Time, Charging
Duration, Energy Consumed (kWh),Cost ($),Payment Method etc.,

Apply various type of Data wrangling techniques to clean and


pre-process the dataset for further analysis with example.

Answer Key:

Data Wrangling Techniques for EV Charging Station Dataset:

1. Handling Missing Values


o Example: Fill missing Payment Method with mode or
mark as "Unknown".
2. Data Type Conversion
o Example: Convert Charging Start Time and Charging
End Time to datetime.
3. Feature Engineering
o Example: Calculate Charging Duration using start and
end times if missing.
4. Removing Duplicates
o Example: Drop duplicate Session ID entries.
5. Normalization/Standardization
o Example: Scale Energy Consumed and Cost for
machine learning models.
6. Filtering Invalid Data
o Example: Remove records with negative Charging
Duration or Energy Consumed.
7. String Cleaning
o Example: Strip whitespace from Location and Payment
Method.
8. Data Aggregation
o Example: Group by Station ID to calculate total energy
used or peak hours.
9. Date-Time Feature Extraction
o Example: Extract hour/day/week from Charging Start
Time to find usage patterns.
10. Outlier Detection and Treatment

 Example: Identify outlier sessions with extremely high Cost or


duration using IQR.

17 a Discuss about Matplotlib configuration using different plot styles 10 2 4 5 2.2.3


with example python code and graphs.

Answer Key:

Answer may include various element configuration to plot the


graph under the Matplotlib with sample code.
Matplotlib Configuration Using Different Plot Styles (with Code &
Graphs)

1. Matplotlib supports multiple built-in plot styles to change


the appearance of graphs easily.
2. Common Styles:
o 'default': Standard Matplotlib look
o 'ggplot': Inspired by R's ggplot2
o 'seaborn': Attractive statistical plots

(OR)
17 b Consider a healthcare data analyst at a research institute studying 10 3 5 5 3.3.1
the connection between dietary habits and common lifestyle-
related diseases. A survey was conducted across different age
groups, and the collected data includes:

 Participant_ID
 Age_Group (e.g., Teen, Adult, Senior)
 Diet_Type (e.g., Vegetarian, Non-Vegetarian, Vegan,
Junk Food)
 Common_Disease (e.g., Obesity, Diabetes, Hypertension,
Heart Disease, None)
 Exercise_Hours_per_Week

Write the sample python code for visualization and plot


the suitable graphs using Matplotlib for the following:

i) Show how many people follow each diet type?


ii) Visualize which diet types are more frequently associated with
specific diseases?
iii) Identify which age groups are more prone to specific
diseases?
iv) Plot the distribution of exercise hours for people with and
without diseases?

Answer Key:

1. Bar Plot for Diet Types: Displays the count of people


following each diet type.
2. Stacked Bar Plot for Diet and Disease: Visualizes the
relationship between diet types and common diseases using
a stacked bar chart.
3. Stacked Bar Plot for Age Group and Disease: Shows
which age groups are more prone to specific diseases.
4. Box Plot for Exercise Hours and Disease: Compares the
distribution of exercise hours between people with and
without diseases.
Example plots:

These visualizations give a clear understanding of the dataset in


terms of diet, age group, disease prevalence, and exercise habits.

Course Outcome (CO) and Bloom’s level (BL) Coverage in Questions:


CO Coverage
60 53 %
50
40
30 26 %
21 %
20
10
0
CO 1 CO 2 CO 3
Register
Number

SRM Institute of Science and Technology


Set -
College of Engineering and Technology
School of Computing
SRM Nagar, Kattankulathur – 603203, Chengalpattu District, Tamil Nadu
Academic Year: 2024-25 (EVEN)

Test: FT4 Date: 29-04-2025


Course Code & Title: 21CSS303T-Data Science Duration: Two periods
Year& Sem: III Year /VI Sem Max.Marks:50

Course Articulation Matrix:


Course PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
CO3 - - - - 1 - - - - - - -
CO4 - - - - 1 - - - - - - -
CO5 - - - - 1 - - - - - - -
Note: CO3 – To identify data manipulation and cleaning techniques using pandas
CO4 – To constructs the Graphs and plots to represent the data using python packages
CO5 – To apply the principles of the data science techniques to predict and forecast the outcome of real-
world problem
Part – A (10 x 1 = 10 Marks)
Instructions:
1) Answer ALL questions.
2) The duration for answering Part A is 15 minutes (this sheet will be collected after 15 minutes).
3) Encircle the correct answer.
Question Marks BL CO PO PI
S.No
Code
1 What is a recommended technique for handling datasets that do not fit 1 1 3 5
into memory?
A. Load the entire data into a list
B. Use streaming or chunking techniques
C. Increase screen resolution
D. Use nested loops
2 What parameter allows merge() to join datasets using an index instead 1 1 3 5
of a column?
A. on_index=True
B. use_index=True
C. left_index=True/right_index=True
D. by_index=True
3 What is the default method of dropna() in pandas? 1 1 3 5
A. Drops rows with missing values
B. Replaces missing values with 0
C. Drops columns with duplicates
D. Sorts data
4 What is binning in data preprocessing? 1 2 3 5
A. Filling missing values
B. Converting continuous variables into categorical bins
C. Merging two datasets
D. Sorting data by time
5 Which of the following techniques can be used to detect outliers or noise 1 2 3 5
in a dataset?
A. Pivoting
B. One-hot encoding
C. Z-score or IQR methods
D. Data splitting
6 Which command is used to create subplots in Matplotlib? 1 1 4 5
A. plt.subplots()
B. plt.sub()
C. plt.mplot()
D. plt.subplotview()
7 What is Seaborn primarily used for? 1 1 4 5
A. Connecting APIs
B. Creating responsive websites
C. Creating statistical graphics on top of Matplotlib
D. Managing databases
8 In Seaborn, which function is used to plot pairwise relationships in a 1 1 4 5
dataset?
A. sns.relations()
B. sns.matrixplot()
C. sns.pairplot()
D. sns.gridplot()
9 What function is used to create a scatter plot in Matplotlib? 1 2 5 5
A. plt.point()
B. plt.scatter()
C. plt.dot()
D. plt.circles()
10 What is the purpose of a histogram? 1 2 5 5
A. To show relationship between two variables
B. To display data distribution and frequency
C. To visualize classification performance
D. To plot trends over time
Register
Number

SRM Institute of Science and Technology


College of Engineering and Technology Set -
School of Computing
SRM Nagar, Kattankulathur – 603203, Chengalpattu District, Tamil Nadu
Academic Year: 2024-25 (EVEN SEM)

Test: FT4 Date:29-04-2025


Course Code & Title: 21CSS303T-Data Science Duration: Two periods
Year& Sem: III Year /VI Sem Max.Marks:50

Part – B (4 x 5 = 20 Marks)
Instructions: Answer ANY FOUR Questions
Q. Question Marks BL CO PO PI
No Code

11 Explain the difference between reshaping, pivoting, and concatenating 5 2 3 5


datasets using pandas.
Ans:
 Reshaping: Changing the structure of data (e.g., melt() converts
wide to long format).
 Pivoting: Converting long data into a wide format (e.g., pivot()
makes a column's values into new columns).
 Concatenating: Combining multiple datasets along rows or
columns (e.g., concat()).

12 Apply binning and standardization to a numerical dataset. Why are these 5 3 3 5


processes important in data preparation?
Ans:
Binning and standardization are important data preprocessing
techniques to improve the performance of machine learning models.
1. Binning: Converts continuous variables into discrete categories
to reduce noise and make patterns clearer.
o Example:
import pandas as pd
data = pd.Series([1, 5, 7, 9, 10, 14, 20])
bins = [0, 5, 10, 20]
labels = ['Low', 'Medium', 'High']
binned_data = pd.cut(data, bins=bins, labels=labels)
2. Standardization: Scales data to have a mean of 0 and standard
deviation of 1, which helps models converge faster.
o Example:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data.values.reshape(-1, 1))
Why important?
 Binning: Simplifies complex data, making it easier for models to
detect patterns.
 Standardization: Ensures that all features are on the same scale,
preventing some features from dominating others in models.

13 Compare and contrast the methods of handling missing data. When 5 2 3 5


would you use each?
Ans:
Removing Missing Data:
 Method: Drop rows or columns with missing values (dropna()).
 Use: When missing data is small and won't significantly affect
the analysis or when data loss is acceptable.
Imputation:
 Method: Fill missing values with a constant (e.g., 0), mean,
median, mode, or predicted values.
 Use: When missing data is significant and removing it would
lead to loss of important information.
Forward/Backward Fill:
 Method: Fill missing values with the previous (or next)
available data (ffill(), bfill()).
 Use: When data is time-series or ordered, and filling missing
values with neighboring data is logical.
Predictive Imputation (e.g., using ML):
 Method: Use machine learning algorithms to predict missing
values based on other features.
 Use: When missing data is substantial and imputation needs to
be more sophisticated.

14 Demonstrate how to generate a 3D surface plot using Matplotlib. 5 3 4 5


Mention the required imports and customization options.
Ans:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Create data
X = np.linspace(-5, 5, 100)
Y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(X, Y)
Z = np.sin(np.sqrt(X**2 + Y**2))

# Create a figure and 3D axis


fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Plot the surface


ax.plot_surface(X, Y, Z, cmap='viridis')

# Customize labels
ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.set_zlabel('Z axis')

# Show plot
plt.show()
Customization Options:
cmap: Color map for the surface (e.g., 'viridis', 'plasma').

ax.set_xlabel(), ax.set_ylabel(), ax.set_zlabel(): Customize axis labels.

ax.plot_surface(): You can add more options like edgecolor, alpha for
transparency, etc.
15 Use Seaborn to create a pairplot and customize its style using 5 3 5 5
sns.set_style() on iris dataset. What insights can a pairplot provide?
Ans:
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Iris dataset


iris = sns.load_dataset('iris')

# Set the style for the plot


sns.set_style('whitegrid')

# Create a pairplot
sns.pairplot(iris, hue='species')

# Show the plot


plt.show()
Customization:
sns.set_style('whitegrid'): Sets the plot background to white with a grid,
which enhances readability.

hue='species': Colors the points according to the different species of the


Iris flower, which helps in visualizing the relationship between features
across categories.

Insights Provided by a Pairplot:


Relationships between Variables: Shows scatter plots between each pair
of features (e.g., Sepal Length vs. Sepal Width), allowing you to identify
correlations.

Distributions: The diagonal plots (histograms or KDEs) show the


distribution of each feature.

Cluster Patterns: Helps detect if species clusters are separable based on the
features (e.g., the species may be visually separable in certain feature
combinations).

Part – C (2 x 10 = 20 Marks)
Instructions: Answer ALL questions.

Q. Question Marks BL CO PO PI
No Code
16 a Describe and compare various techniques used to clean and prepare raw 10 2 3 5
datasets for analysis. Include examples of handling missing data,
standardization, string cleaning, and binning. Give python code
examples of each.

Ans: 1. Handling Missing Data


 Method: Removing or imputing missing values.
 Example:
o Remove rows with missing data:

import pandas as pd
df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, None, 7, 8]})
df_cleaned = df.dropna() # Remove rows with any missing values
o Impute missing data:

df_imputed = df.fillna(df.mean()) # Replace missing with column


mean
2. Standardization (Scaling)
 Method: Scale features to have a mean of 0 and a standard
deviation of 1.
 Example:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['A', 'B']])
3. String Cleaning
 Method: Remove or replace unwanted characters, whitespace,
or patterns from string columns.
 Example:
df['Name'] = df['Name'].str.strip().str.replace(r'\d+', '') # Remove digits
and whitespace
4. Binning (Discretization)
 Method: Convert continuous variables into categorical bins.
 Example:
df['Age'] = pd.cut(df['Age'], bins=[0, 18, 35, 50, 100], labels=['Child',
'Young', 'Adult', 'Senior'])

(OR)

16 b Write and explain a complete data transformation workflow using a 10 3 3 5


sample dataset that includes missing values, text inconsistencies,
numeric scaling, and outliers. Give examples using python code.
1. Ans: Load the Dataset:

import pandas as pd
import numpy as np

# Sample data with missing values, text inconsistencies, and outliers


data = {
'Age': [25, np.nan, 22, 35, 110, 29, 200],
'Salary': [50000, 60000, np.nan, 45000, 120000, 70000, 400000],
'Name': ['John Doe', ' Jane smith ', 'alice johnson', 'BOB', 'alice', '
john', ' jane'],
'City': ['New York', 'Los Angeles', 'New York', np.nan, 'San
Francisco', 'New York', 'Miami']
}
df = pd.DataFrame(data)
1. Handle Missing Values:
 Impute missing values with appropriate methods (mean for
numeric, mode for categorical).

# Impute missing numeric values


df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Salary'] = df['Salary'].fillna(df['Salary'].median())

# Impute missing categorical values


df['City'] = df['City'].fillna(df['City'].mode()[0])
2. Text Cleaning:
 Standardize text data by removing extra spaces, converting to
lowercase, etc.

# Clean and standardize text data


df['Name'] = df['Name'].str.strip().str.title() # Capitalize names and
remove leading/trailing spaces
df['City'] = df['City'].str.strip().str.title() # Ensure consistent city names
3. Handle Outliers:
 Identify and remove outliers using the IQR method.

# Identifying outliers in 'Age' and 'Salary' using IQR


Q1_age = df['Age'].quantile(0.25)
Q3_age = df['Age'].quantile(0.75)
IQR_age = Q3_age - Q1_age
lower_bound_age = Q1_age - 1.5 * IQR_age
upper_bound_age = Q3_age + 1.5 * IQR_age

# Remove outliers
df = df[(df['Age'] >= lower_bound_age) & (df['Age'] <=
upper_bound_age)]
4. Numeric Scaling (Standardization):
 Standardize numeric columns like 'Age' and 'Salary' to have
zero mean and unit variance.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])
Final Dataframe:

print(df)

17 a Explain how Matplotlib helps in customizing plots. Describe how to 10 2 4 5


control axes, add labels, legends, annotations, and apply plot styles with
examples.
Explain the differences and use-cases of different plot types: Line plot,
Bar chart, Histogram, Box plot, Scatter plot, and Pair plot.
Ans:

Matplotlib provides powerful customization options for creating


and enhancing plots. You can control various elements like axes,
labels, legends, annotations, and styles. Here's how to customize
these features:
1. Controlling Axes:
 You can control the axis limits, ticks, and labels using
set_xlim(), set_ylim(), and set_xticks()/set_yticks().

import matplotlib.pyplot as plt


x = [1, 2, 3, 4]
y = [1, 4, 9, 16]

plt.plot(x, y)
plt.xlim(0, 5) # Set x-axis limit
plt.ylim(0, 20) # Set y-axis limit
plt.show()
2. Adding Labels and Title:
 xlabel(), ylabel(), and title() are used to add labels and titles.

plt.plot(x, y)
plt.xlabel('X-axis Label')
plt.ylabel('Y-axis Label')
plt.title('Plot Title')
plt.show()
3. Legends:
 Use legend() to add a legend to the plot. You can label your
plots during plotting and then call legend().
plt.plot(x, y, label='y = x^2')
plt.legend()
plt.show()
4. Annotations:
 Use annotate() to add text or markers to specific points on
the plot.

plt.plot(x, y)
plt.annotate('Peak', xy=(2, 4), xytext=(3, 5),
arrowprops=dict(facecolor='red', arrowstyle="->"))
plt.show()
5. Applying Plot Styles:
 Use plt.style.use() to apply predefined styles such as ggplot,
seaborn, etc.

plt.style.use('ggplot')
plt.plot(x, y)
plt.show()

Different Plot Types and Their Use-Cases


1. Line Plot:
o Use-case: Ideal for showing trends over time or
continuous data.
o Example: Plotting stock prices or temperature
changes.

plt.plot([1, 2, 3, 4], [1, 4, 9, 16])


plt.show()
2. Bar Chart:
o Use-case: Useful for comparing quantities across
different categories (categorical data).
o Example: Comparing sales across different
products.

plt.bar(['A', 'B', 'C'], [3, 7, 2])


plt.show()
3. Histogram:
o Use-case: Shows the distribution of data, often for
continuous numerical data.
o Example: Displaying the distribution of ages in a
dataset.

plt.hist([1, 2, 2, 3, 3, 3, 4], bins=4)


plt.show()
4. Box Plot:
o Use-case: Useful for visualizing the distribution of
data, including outliers, median, and quartiles.
o Example: Analyzing the spread of test scores.

plt.boxplot([1, 2, 3, 4, 5, 6, 7])
plt.show()
5. Scatter Plot:
o Use-case: Displays relationships between two
variables, useful for correlation analysis.
o Example: Visualizing the relationship between
height and weight.

plt.scatter([1, 2, 3, 4], [1, 4, 9, 16])


plt.show()
6. Pair Plot:
o Use-case: Used for visualizing relationships
between multiple variables in a dataset.
o Example: Showing pairwise relationships in the Iris
dataset.

import seaborn as sns


iris = sns.load_dataset('iris')
sns.pairplot(iris, hue='species')
plt.show()

(OR)
17 b Apply advanced Seaborn visualizations to explore patterns in a real 10 3 5 5
dataset. Include pair plots, heatmaps, and style settings. Write a Python
program to visualize a 3D surface plot. Explain each component used in
the plot.
Ans:

import seaborn as sns


import matplotlib.pyplot as plt

# Load dataset
iris = sns.load_dataset('iris')

# Set style
sns.set_style("whitegrid")

# Pair plot
sns.pairplot(iris, hue='species')
plt.show()

# Heatmap (correlation matrix)


corr = iris.drop('species', axis=1).corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.show()

Explanation:
sns.set_style(): Sets plot background style.

pairplot(): Shows pairwise relationships and class separation.

heatmap(): Highlights correlations between numeric features.

3D Surface Plot with Matplotlib


import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Data for the surface


X = np.linspace(-5, 5, 100)
Y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(X, Y)
Z = np.sin(np.sqrt(X**2 + Y**2))

# Create 3D plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
# Plot surface
surf = ax.plot_surface(X, Y, Z, cmap='viridis')

# Add labels
ax.set_xlabel('X Axis')
ax.set_ylabel('Y Axis')
ax.set_zlabel('Z Axis')
plt.title('3D Surface Plot')
plt.show()

Explanation:
Axes3D: Enables 3D plotting.

meshgrid: Generates grid for surface.

plot_surface: Draws the 3D surface.

cmap: Applies color styling to surface.

Course Outcome (CO) and Bloom’s level (BL) Coverage in Questions:

CO Coverage
60 53 %
50
40
30 26 %
21 %
20
10
0
CO 1 CO 2 CO 3
Register
Number

SRM Institute of Science and Technology


Set -
College of Engineering and Technology
School of Computing
SRM Nagar, Kattankulathur – 603203, Chengalpattu District, Tamil Nadu
Academic Year: 2024-25 (EVEN)

Test: FT4 Date: 29-04-2025


Course Code & Title: 21CSS303T-Data Science Duration: Two periods
Year& Sem: III Year /VI Sem Max.Marks:50

Course Articulation Matrix:


Course PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
CO3 - - - - 1 - - - - - - -
CO4 - - - - 1 - - - - - - -
CO5 - - - - 1 - - - - - - -
Note: CO3 – To identify data manipulation and cleaning techniques using pandas
CO4 – To constructs the Graphs and plots to represent the data using python packages
CO5 – To apply the principles of the data science techniques to predict and forecast the outcome of real-
world problem
Part – A (10 x 1 = 10 Marks)
Instructions:
1) Answer ALL questions.
2) The duration for answering Part A is 15 minutes (this sheet will be collected after 15 minutes).
3) Encircle the correct answer.

S.No Question Marks BL CO PO PI Code

1 State the data wrangling operation that handles errors, missing data and 1 1 3 5 5.4.1
inconsistencies
a. Validation
b. Data enrichment
c. Cleaning
d. Organization
2 Name the pandas method that can be used to combine DataFrames using one 1 1 3 5 5.4.1
or more keys, as in database join operations
a. pandas.concat
b. pandas.merge
c. DataFrame.combine_first
d. DataFrame.join
3 Define the objective of imputation process 1 1 3 5 5.4.1
a. Remove entire rows or columns containing missing values
b. Remove pairs of observations where at least one value is missing
c. Replacing missing data with estimated values
d. Remove noise from the dataset using some algorithms
4 Identify the reshape process among the following that turns unique values 1 2 3 5 5.4.1
from one column into new column headers, effectively transforming long-
form data to wide -form
a. Melting
b. Stacking
c. Pivoting
d. Unstacking

5 Which among the following is a common measure of dispersion of data 1 2 3 5 5.4.1


a. median
b. standard deviation
c. histogram
d. skewness
6 In Matplotlib, which of the following correctly creates a subplot at position 5 1 1 4 5 5.5.2
in a 4-row by 3-column grid?
a. plt.subplot(3, 4, 5)
b. plt.subplot(5, 3, 4)
c. plt.subplot(4, 3, 5)
d. plt.subplot(5, 4, 3)
7 From the below list, recall the construct used to add text or markers to 1 1 4 5 5.4.1
specific locations on a plot to highlight particular features
a. Legends
b. Labels
c. Annotations
d. Ticks
8 Among the following statements, recognize the correct statement about 1 1 4 5 5.5.1
Python’s matplotlib.pyplot package
a. pyplot is used only for 3D plotting in Python.
b. pyplot automatically displays plots without the need to call show().
c. pyplot provides a MATLAB-like interface for creating static,
interactive, and animated plots.
d. pyplot cannot save plots in pdf format.
9 Identify the Seaborn package feature that allows you to visualize relationship 1 2 5 5 5.4.1
between all pairs of numeric columns in DataFrames
a. FacetGrid
b. Pairplot
c. Scatterplot
d. subplot
10 Identify the incorrect statement regarding seaborn package 1 2 5 5 5.5.1
a. Seaborn is a data visualization library built on top of Matplotlib
b. Seaborn allow us to represent data points in three-dimensional space
c. Seaborn can be imported using import matplotlib.seaborn as sns
d. Seaborn can be used to visualize textual data by creating wordcloud
Register
Number

SRM Institute of Science and Technology


College of Engineering and Technology Set -
School of Computing
SRM Nagar, Kattankulathur – 603203, Chengalpattu District, Tamil Nadu
Academic Year: 2024-25 (EVEN SEM)

Test: FT4 Date:29-04-2025


Course Code & Title: 21CSS303T-Data Science Duration: Two periods
Year& Sem: III Year /VI Sem Max.Marks:50

Part – B (4 x 5 = 20 Marks)
Instructions: Answer ANY FOUR Questions

Q. Question Marks BL CO PO PI
No Code

11 Discuss different data structures that help optimize memory and 5 2 3 5 5.6.1
computation while handling large data volumes. Briefly review their
strengths and weaknesses.

Ans:
Data structures have different storage requirements, but also
influence the performance of CRUD (create, read, update, and
delete) and other operations on the data set

• Tree is a hierarchical data structure where each node has a parent


and may have child nodes, used for searching and sorting. Trees are
a class of data structure that allows you to retrieve information much
faster than scanning through a table
• Hash is a key-value data structure that provides fast lookups using
a hash function. A key for every value in your data and put the keys
in a bucket. This way you can quickly retrieve the information by
looking in the right bucket when you encounter the data.
Dictionaries in Python are a hash table implementation, and they’re
a close relative of key-value stores
• Sparse data refers to datasets with mostly zero or missing values,
stored efficiently to save memory.
12 Given the following scenario, perform appropriate data cleaning, 5 3 3 5 5.5.2
transformation, and merging steps:

Dataset A contains employee records with columns: EmpID, Name, Age,


and Department. Some age values are missing, and department names
have inconsistent casing (e.g., "HR", "hr", "Hr").

Dataset B contains salary details with columns: EmpID, MonthlySalary.

Write Python code (using pandas) to:


1. Clean the Age using suitable imputation
2. Clean the Name by removing unnecessary spaces
3. Apply standardize capitalization on the column Department.
4. Merge the two datasets on EmpID.
5. Display the total salary aggregated on the Department column

(You may assume dummy data for illustration.)

Ans:
1. Convert datasets to DataFrames
df_a = pd.DataFrame(data_a)
df_b = pd.DataFrame(data_b)

2. Clean the Age column using suitable imputation


df_a['Age'].fillna(df_a['Age'].mean(), inplace=True)

3. Clean the Name column by removing unnecessary spaces


df_a['Name'] = df_a['Name'].str.strip()

4. Standardize capitalization of the Department column


df_a['Department'] = df_a['Department'].str.capitalize()

5. Merge the two datasets on EmpID


merged_df = pd.merge(df_a, df_b, on='EmpID')

6. Display the total salary aggregated by the Department


total_salary_by_dept =
merged_df.groupby('Department')['MonthlySalary'].sum().reset_in
dex()

7. Display the result


print(total_salary_by_dept)

13 Distinguish between Z-score normalization and Min-max normalization. 5 2 3 5 5.6.1


Under what data conditions would each method be more appropriate?

Ans:
Z-score normalization is a data preprocessing technique that
transforms numerical data to have a mean of 0 and a standard
deviation of 1. This is particularly useful when dealing with features
that have different scales or units, as it ensures that all features
contribute equally to the model.

Advantages:
1. Handles different Scales
2. Improves Machine Learning Models
3. Reduce Bias
4. Helps with outliers
Min-max normalization is a data preprocessing technique that
scales numerical data to a specific range, typically between 0 and 1.
It's useful when you want to preserve the relative distances between
data points while ensuring that all features have a similar scale

14 Write the python code for creating s 2 X 2 grid of plots with the 5 3 4 5 5.5.2
following subplots using matplotlib.pyplot
1. Grid 1 – line plot
2. Grid 2 – Scatter plot
3. Grid 3 – Bar
4. Gid 4 – histogram

(You may assume dummy data (Qno:12) for illustration.)

Ans:
import matplotlib.pyplot as plt
import numpy as np

#Data
x = np.arange(1, 6)
y = x ** 2
categories = ['A', 'B', 'C', 'D', 'E']
values = [5, 7, 3, 8, 6]
hist_data = np.random.randn(1000)

#Plotting
plt.figure(figsize=(10, 8))

plt.subplot(2, 2, 1)
plt.plot(x, y, marker='o')
plt.title('Line Plot')

plt.subplot(2, 2, 2)
plt.scatter(x, y, color='green')
plt.title('Scatter Plot')

plt.subplot(2, 2, 3)
plt.bar(categories, values, color='orange')
plt.title('Bar Plot')

plt.subplot(2, 2, 4)
plt.hist(hist_data, bins=20, color='purple')
plt.title('Histogram')

plt.tight_layout()
plt.show()
15 You are given a dataset that contains the daily temperature (Temp), 5 3 5 5 5.5.2
humidity (Humidity), and air quality index (AQI) recorded over 5 days
.
Days = [1,2,3,4,5]
Temperature = [23,25,28,32,35]
AQI = [3,5,4,2,5]
Write Python code using Seaborn and Matplotlib to visualize the
relationship among these three variables using a 3D line plot, where:
• X-axis → Day (as a sequence)
• Y-axis → Temperature
• Z-axis → AQI

Ans:
import matplotlib.pyplot as plt
import seaborn as sns

# Data
Days = [1, 2, 3, 4, 5]
Temperature = [23, 25, 28, 32, 35]
AQI = [3, 5, 4, 2, 5]

# Create 3D plot
sns.set(style="whitegrid")
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')

# Plotting the 3D line


ax.plot(Days, Temperature, AQI, marker='o', color='blue',
label='Temp vs AQI')

# Label axes
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (°C)')
ax.set_zlabel('AQI')
ax.set_title('3D Line Plot of Day vs Temperature vs AQI')

# Show plot
plt.legend()
plt.show()

Part – C (2 x 10 = 20 Marks)
Instructions: Answer ALL questions.

Q. Question Marks BL CO PO PI
No Code
16 a How missing values are represented in a dataset? With examples, 10 2 3 5 5.5.1
describe the various imputation techniques used for handling of missing
values so that there is minimum loss of information.

Ans:
Imputation is the process of replacing missing data with estimated
values to maintain dataset integrity.

Mean/Median/Mode Imputation: Replace missing values with


the mean, median, or mode of the respective column. This is a
simple approach but can introduce bias if the distribution is
skewed
When to Use:
•Mean: Best for normally distributed data.
•Median: Preferred when data is skewed or has outliers.
•Mode: Used for categorical data.
K-Nearest Neighbors (KNN) Imputation: Impute missing
values using the average values of the k nearest neighbors. This
method can be effective for numerical data.

Regression Imputation: Use regression models to predict


missing values based on other features. This is suitable for
numerical data with strong relationships between features.

Multiple Imputation: Create multiple imputed datasets by filling


in missing values with different plausible values. This method can
help to account for uncertainty in the imputation process

Choosing the right approach

The best approach for handling missing values depends on the


nature of your data, the amount of missing data, and the specific
requirements of your analysis. Consider the following factors:

• Amount of missing data: If there are many missing values,


imputation might be preferable to deletion.
• Distribution of missing data: If missingness is random,
imputation might be suitable. If missingness is related to
other variables, more sophisticated techniques might be necessary.
• Impact of missing data on the analysis: If missing values are
likely to bias your results, it's important to address
them.

Give a simple example.

(OR)

16 b You are given a Pandas DataFrame containing a column Customer_Info 10 3 3 5 5.5.2


with inconsistent entries like:

" Mr. Ramesh K , Chennai - 600001 "


"Ms. PRIYA D,COIMBATORE-641002"
"Dr. Arjun,Madurai - 625001"
"Mrs. Leela S , Chennai - 6251 "

Perform the following tasks using Pandas string manipulation methods:


1. Strip leading and trailing whitespaces from the entire
Customer_Info column.
2. Replace all hyphens - with a single space and convert multiple
spaces to a single space.
3. Extract the following components into new columns:
o Title (Mr., Ms., Dr., etc.)
o Name (in uppercase)
o City (in title case)
4. Pad the PIN code column (if needed) so that all valid entries
have 6 digits (e.g., "6251" becomes "006251").

Ans:
import pandas as pd

# Create dataframe
data = {
'Customer_Info': [
" Mr. Ramesh K , Chennai - 600001 ",
"Ms. PRIYA D,COIMBATORE-641002",
"Dr. Arjun,Madurai - 625001",
"Mrs. Leela S , Chennai - 6251 "
]
}

df = pd.DataFrame(data)

1. Strip leading and trailing whitespaces

df['Customer_Info'] = df['Customer_Info'].str.strip()

2. Replace hyphens with space and normalize multiple


spaces
df['Customer_Info'] = df['Customer_Info'].str.replace('-', ' ',
regex=False)
df['Customer_Info'] = df['Customer_Info'].str.replace(r'\s+', ' ',
regex=True)

3. Extract Title, Name, City, and PIN using regex


df[['Title', 'Name', 'City', 'PIN']] = df['Customer_Info'].str.extract(
r'(Mr\.|Mrs\.|Ms\.|Dr\.)\s+([A-Za-z\s]+),?\s*([A-Za-
z]+)\s+(\d+)', expand=True
)
4. Format extracted fields
df['Name'] = df['Name'].str.upper().str.strip()
df['City'] = df['City'].str.title().str.strip()

5. pad PIN with zeros if less than 6 digits


df['PIN'] = df['PIN'].str.zfill(6)

print(df[['Title', 'Name', 'City', 'PIN']])

17 a Explain the features of Seaborn library. Also describe the importance 10 2 4 5 5.5.1
of Facet Grid, joint plot and pair plot with example implementation.

Ans:
• Seaborn is a library mostly used for statistical plotting in
Python.
• It is built on top of Matplotlib and provides beautiful default
styles and color palettes to make statistical plots more
attractive.

Features of Seaborn

Statistical Graphics: Seaborn is specifically designed for


creating statistical graphics, providing built-in functions for
common visualizations like scatter plots, line plots, histograms,
and more. This makes it easier to create visually appealing and
informative plots for data analysis.

Data Visualization Themes: Seaborn offers pre-defined styles


and themes that can quickly change the overall appearance of your
plots. This helps create consistent and aesthetically pleasing
visualizations without requiring extensive customization.

Integration with Pandas and NumPy: Seaborn seamlessly


integrates with Pandas and NumPy, making it easy to work with
dataframes and arrays directly. This simplifies the workflow and
reduces the amount of code needed for data analysis and
visualization.

FacetGrid and Pair Plots: Seaborn provides FacetGrid for


grouping data and creating subplots based on categorical
variables. This is useful for comparing distributions or
relationships across different groups. Pair plots allow you to
visualize the relationships between all pairs of numeric columns
in a DataFrame, helping you identify correlations and patterns.

Customization and Flexibility: While Seaborn provides a high-


level interface, it's built on top of Matplotlib, giving you access to
its extensive customization options. This allows you to fine-tune
your plots to meet your specific needs.

Ease of Use: Seaborn's API is designed to be user-friendly and


intuitive, making it easier to learn and use compared to
Matplotlib. Its documentation is also well-written and provides
clear examples.

3D Plots

FacetGrid: Group data by a categorical variable and plot


individual subplots for each category.

g = sns.FacetGrid(df, col="hue", height=4)

Jointplot: Visualize the relationship between two variables and


their distributions.

sns.jointplot(x='x', y='y', kind="scatter", data =data)

Pairplot: Visualize the relationships between all pairs of numeric


columns in a DataFrame.

sns.pairplot(df)

(OR)

17 b You are provided with a sample dataset of product sales in a CSV file 10 3 5 5 5.5.2
named product_sales.csv. The dataset contains the following columns:

Product_ID Category Region Units_Sold Sale_Price


P001 Electronics South 120 14500
P002 Furniture North 75 9800
P003 Electronics East 10 13200
P004 Clothing West 160 3200
P005 Furniture South 90 8900
P006 Electronics East 110 15000
P007 Clothing North 140 3000
Using Seaborn, generate:
Ans:
Import Libraries

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the CSV file


df = pd.read_csv('product_sales.csv')

# Set Seaborn style


sns.set(style='whitegrid')

#1. Histogram of Units_Sold


plt.figure(figsize=(8, 5))
sns.histplot(df['Units_Sold'], bins=10, kde=True, color='skyblue')
plt.title('Distribution of Units Sold')
plt.xlabel('Units Sold')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

#2. Box plot of Sale_Price by Category


plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x='Category', y='Sale_Price', palette='Set2')
plt.title('Sale Price by Product Category')
plt.xlabel('Product Category')
plt.ylabel('Sale Price')
plt.tight_layout()
plt.show()
• A histogram showing the distribution of Units_Sold for all
products.
• A box plot comparing Sale_Price across different Category
values.

Course Outcome (CO) and Bloom’s level (BL) Coverage in Questions

CO Coverage
60% 55%

50% 45%

40%

30%

20%

10%

0%
SRM Institute of Science and Technology
Set -
College of Engineering and Technology
School of Computing
SRM Nagar, Kattankulathur – 603203, Chengalpattu District, Tamil Nadu
Academic Year: 2024-25 (EVEN)

Test: FT4 Date: 29-04-2025


Course Code & Title: 21CSS303T-Data Science Duration: Two periods
Year& Sem: III Year /VI Sem Max.Marks:50
SET –
Answer Key

Part – A (10 x 1 = 10 Marks)

S.No Question Marks

1 a) Data Collection → Data Cleaning → Data Transformation → Data Analysis 1

2 b) Replaces all NaN values with 0 1


3 d) merge() 1

4 a) To combine datasets horizontally or vertically 1

5 b) Remove non-numeric values or replace them with NaN 1

6 b) To add text annotations to specific points on the plot 1

7 c) Pair plot 1

8 a) plt.style.use('seaborn-darkgrid') 1

9 b) sns.histplot() 1

10 d) Plots a scatter plot matrix grouped by species 1

Q. Part – B (4 x 5 = 20 Marks) Marks


No Instructions: Answer ANY FOUR

11 Discuss the general programming tips to deal with large data sets. 5
 Don’t reinvent the wheel. Use tools and libraries developed by others
 Get the most out of your hardware. Your machine is never used to it full potential;
with simple adaptions you can make it work harder.
 Reduce the computing need. Slim down your memory and processing needs as much
as possible.
12 When merging two DataFrames in pandas that have columns with the same name, 5
how can you ensure the column names are distinguishable?
Use the suffixes parameter in the merge() function to add distinguishing suffixes to
overlapping column names.
import pandas as pd
df1 = pd.DataFrame({'ID': [1, 2], 'Value': [10, 20]})
df2 = pd.DataFrame({'ID': [1, 2], 'Value': [30, 40]})
merged_df = pd.merge(df1, df2, on='ID', suffixes=('_left', '_right'))
print(merged_df)
13 Given the dataset data ={'Ages': [3, 18, 22, 10, 25, 29, 34, 14, 40, 45, 50, 55, 60, 12, 65, 5
70, 75, 80, 85]}, categorize the continuous Ages values into the groups of children,
young, middle, and elder. Define appropriate age ranges for each category and
implement the conversion.
import pandas as pd
data = {'Ages': [3, 18, 22, 10, 25, 29, 34, 14, 40, 45, 50, 55, 60, 12, 65, 70, 75, 80,
85]}
df = pd.DataFrame(data)
bins = [0, 12, 24, 59, 100]
labels = ['Child', 'Young', 'Middle', 'Elder']
df['Category'] = pd.cut(df['Ages'], bins=bins, labels=labels)
print(df)
14 Compare a box plot and a histogram, highlighting their use cases and strengths. 5
Box Plot:
 Displays the distribution of data and highlights outliers.
 Ideal for comparing multiple datasets.
Histogram:
 Shows the frequency distribution of data values.
 Useful for understanding the shape of the data (e.g., skewness).

15 How can you control the line properties (e.g., color, style, and width) of a chart in 5
Matplotlib. Write the python code and explain.

import matplotlib.pyplot as plt


x = [1, 2, 3, 4, 5]
y = [10, 20, 15, 25, 30]
plt.plot(x, y, color='red', linestyle='--', linewidth=2)
plt.title("Line Properties Example")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()

 color: Sets the line color.


 linestyle: Controls the style of the line (e.g., dashed, solid).
 linewidth: Adjusts the thickness of the line.
Q. Part – C (2 x 10 = 20 Marks) Marks
No Instructions: Answer ALL questions.

16 a  Missing Data: 10
Fill missing sales values with the median (robust to outliers). Drop rows
if there are very few missing values.
Example Code:`
df['Sales'] = df['Sales'].fillna(df['Sales'].median())
 Irregular Formats:
Convert all dates into a uniform format (YYYY-MM-DD) using
pd.to_datetime.
Example Code:
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
 Duplicate Records:
Remove rows where Product, Region, and Date are duplicated, keeping the
first occurrence:
Example Code:
df = df.drop_duplicates(subset=['Product', 'Region', 'Date'], keep='first')

 Irrelevant Data:
Drop unnecessary or irrelevant columns like Transaction ID
Example Code:
df = df.drop(columns=['Transaction ID'])
 Outliers:
Identify outliers in Sales using the interquartile range (IQR)
Example Code:
Q1 = df['Sales'].quantile(0.25)
Q3 = df['Sales'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df['Sales'] >= lower_bound) & (df['Sales'] <= upper_bound)]

 Categorical Inconsistencies:
Standardize inconsistent product names using a mapping dictionary:
Example Code:
product_mapping = {'Appl': 'Apple', 'Bananaa': 'Banana'}
df['Product'] = df['Product'].replace(product_mapping)

 Merging:
Load the profit margins dataset and merge with the sales data on Product and
Region
Example Code:
profit_data = pd.read_csv('profit_margins.csv')
df = pd.merge(df, profit_data, on=['Product', 'Region'], how='inner')
.
 Final Quality Checks
 Ensure all columns have consistent data types:

df['Sales'] = df['Sales'].astype(float)
df['Date'] = pd.to_datetime(df['Date'])

 Verify no missing or inconsistent values remain:

print(df.isnull().sum())
16 b Output of pivot_df = df.pivot(index='Date', columns='Product', 10
values='Sales')
The pivot function reshapes the DataFrame by specifying:
 index: Rows of the resulting DataFrame (Date here).
 columns: Columns of the resulting DataFrame (Product here).
 values: Data to fill the cells (Sales here).
Output:

Product Apple Banana


Date
2023-01-01 100 150
2023-01-02 200 50
Discussion:
 The rows are indexed by Date.
 The columns are determined by unique values in Product.
 The values in the cells are taken from the Sales column.

2. Output of stacked_df = df.stack()


The stack function compresses columns into a hierarchical index at the row level.
Output:
0 Date 2023-01-01
Product Apple
Region North
Sales 100
1 Date 2023-01-01
Product Banana
Region North
Sales 150
2 Date 2023-01-02
Product Apple
Region South
Sales 200
3 Date 2023-01-02
Product Banana
Region South
Sales 50
dtype: object
Discussion:
 Each row is identified by a combination of the original row index (e.g., 0,
1, etc.) and the column name (e.g., Date, Product, Region, Sales).
 The DataFrame is reshaped into a Series with a multi-level index.

3. Output of stacket_pivot = pivot_df.stack()


The stack function on pivot_df moves the columns (Product) back into the row
index.
Output:
Date Product
2023-01-01 Apple 100
Banana 150
2023-01-02 Apple 200
Banana 50
dtype: int64
Discussion:
 The columns (Apple, Banana) are turned into a new hierarchical index level
under Date.
 The resulting structure is a Series, with the multi-level index representing
the combination of Date and Product.

17 a Functionalities of the Seaborn Library 10


Seaborn is a Python library built on top of Matplotlib that simplifies creating
informative and aesthetically pleasing statistical graphics. It provides high-level
interfaces for creating attractive and complex visualizations.
Key Features:
1. Theme Customization: Automatically styles plots for aesthetics.
2. Dataset-Oriented: Works efficiently with DataFrames and arrays.
3. Built-In Statistical Analysis: Includes options for regression, distribution
fitting, and more.
4. Integration with Pandas: Seamless handling of DataFrame columns.
5. Wide Range of Plot Types: Includes pair plots, box plots, violin plots,
heatmaps, and more.

Examples of Key Visualizations


1. Pair Plot
A pair plot is useful for visualizing pairwise relationships in a dataset, especially
numerical features. It provides scatterplots for relationships and histograms for
univariate distributions.
Code Example:
import seaborn as sns
import pandas as pd

# Example Dataset
data = sns.load_dataset('iris')

# Pair Plot
sns.pairplot(data, hue='species')
Functionality:
 Displays scatter plots between every pair of numerical columns.
 Includes diagonal histograms to visualize the distribution of each feature.
 Uses hue to color the data points based on a categorical column (species).

2. Box Plot
A box plot summarizes the distribution of a dataset through five-number summary
statistics: minimum, first quartile (Q1), median, third quartile (Q3), and maximum.
It also highlights potential outliers.
Code Example:
# Box Plot
sns.boxplot(x='species', y='sepal_width', data=data)
Functionality:
 Displays distributions and compares groups (e.g., species) for a numerical
column (sepal_width).
 Identifies outliers as points outside the whiskers.
 Can be enhanced with swarm plots to overlay individual data points.

3. Histogram
A histogram visualizes the distribution of a single numerical variable by grouping
data into bins.
Code Example:
# Histogram
sns.histplot(data['sepal_length'], kde=True, bins=20)
Functionality:
 Shows the frequency of data points within specified bins.
 Optionally overlays a kernel density estimate (KDE) curve for a smoothed
representation of the distribution.
 Parameters like bins control the granularity of the visualization.

17 b Creating and Interpreting 3D Surface Plots 10


Example Program ( 5 marks for the correct usage of add.subplot, polt.surface
and set_xlabel methods)
3D surface plots are a type of visualization used to represent three-dimensional
data where the z-axis corresponds to the dependent variable, and the x and y axes
represent independent variables. These plots are useful for exploring relationships
between variables and identifying patterns or trends.

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Generate data
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
Z = np.sin(np.sqrt(X**2 + Y**2))

# Create the 3D plot


fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
surface = ax.plot_surface(X, Y, Z, cmap='viridis', edgecolor='none')

# Add labels and a color bar


ax.set_xlabel('X-axis')
ax.set_ylabel('Y-axis')
ax.set_zlabel('Z-axis')
plt.colorbar(surface, ax=ax, shrink=0.5, aspect=10)
plt.title('3D Surface Plot Example')
plt.show()

Scenarios Where 3D Surface Plots Are Beneficial ( 5 Mark)


1. Data Exploration in Engineering:
 Analyzing stress, temperature, or pressure distribution over a 2D
plane.
 Example: Surface temperature of a material under specific
conditions.
2. Optimization Problems:
 Visualizing cost functions in machine learning or operations
research.
 Example: Understanding the shape of loss functions during model
training.
3. Geographic and Environmental Data:
 Representing terrain elevation or pollution levels.
 Example: A surface plot of altitude over a geographical region.
4. Physics and Mathematics:
 Illustrating functions or mathematical surfaces.
 Example: Visualizing wave functions or potential fields.
5. Economics and Finance:
 Exploring relationships between variables like interest rates, risk,
and returns.
 Example: 3D visualization of portfolio optimization.
Register
Number

SRM Institute of Science and Technology


Set -
College of Engineering and Technology
School of Computing
SRM Nagar, Kattankulathur – 603203, Chengalpattu District, Tamil Nadu
Academic Year: 2024-25 (EVEN)

Test: FT4 Date: 29-04-2025


Course Code & Title: 21CSS303T-Data Science Duration: Two periods
Year& Sem: III Year /VI Sem Max.Marks:50

Course Articulation Matrix:


Course PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
CO3 - - - - 1 - - - - - - -
CO4 - - - - 1 - - - - - - -
CO5 - - - - 1 - - - - - - -
Note: CO3 – To identify data manipulation and cleaning techniques using pandas
CO4 – To constructs the Graphs and plots to represent the data using python packages
CO5 – To apply the principles of the data science techniques to predict and forecast the outcome of real-
world problem
Part – A (10 x 1 = 10 Marks)
Instructions:
1) Answer ALL questions.
2) The duration for answering Part A is 15 minutes (this sheet will be collected after 15 minutes).
3) Encircle the correct answer.
Question Marks BL CO PO PI
S.No
Code
1 Which of the following tools is used for compactly storing large arrays and 1 1 3 5
supports memory-mapping?

A) NumPy
B) Matplotlib
C) Bcolz
D) Seaborn
2 What does the method combine_first () do in data wrangling? 1 1 3 5

A) Deletes duplicate data


B) Fills missing values with data from another DataFrame
C) Joins two DataFrames based on index
D) Sorts data in ascending order

3 Which of these is not a general technique to handle large datasets? 1 1 3 5

A) Data compression
B) Parallel processing
C) Data visualization
D) Batch learning
4 Which Python library enables parallel execution and optimization of computation 1 2 3 5
flow?

A) Pandas
B) Matplotlib
C) Dask
D) Numexpr

5 What is the main purpose of using the melt() function in pandas? 1 2 3 5

A) To remove duplicates
B) To convert wide data into long format
C) To merge datasets
D) To perform statistical analysis
6 Which function is used to create a histogram in Matplotlib? 1 1 4 5

A) plot()
B) hist()
C) bar()
D) scatter()
7 What does plt.legend() do in a Matplotlib plot? 1 1 4 5

A) Sets the plot title


B) Adds a legend to the plot
C) Changes the axis labels
D) Saves the figure
8 Which of the following Seaborn functions helps to visualize pairwise 1 1 4 5
relationships in a dataset?

A) jointplot()
B) pairplot()
C) distplot()
D) catplot()
9 Which parameter controls the resolution of the saved figure using savefig()? 1 2 5 5

A) dpi
B) bbox_inches
C) pad_inches
D) format
10 Which type of annotation includes text with arrows to highlight specific points? 1 2 5 5

A) Tick
B) Title
C) Callout
D) Label
Register
Number

SRM Institute of Science and Technology


College of Engineering and Technology Set -
School of Computing
SRM Nagar, Kattankulathur – 603203, Chengalpattu District, Tamil Nadu
Academic Year: 2024-25 (EVEN SEM)

Test: FT4 Date:29-04-2025


Course Code & Title: 21CSS303T-Data Science Duration: Two periods
Year& Sem: III Year /VI Sem Max.Marks:50

Part – B (4 x 5 = 20 Marks)
Instructions: Answer ANY FOUR Questions
Q. Question Marks BL CO PO PI
No Code
Explain the issues faced when handling large datasets and suggest suitable 5 2 3 5
techniques to address them.

Answer:
Issues:
 Memory overload: Large datasets exceed available RAM,
causing system slowdown or crashes.
 Slow processing: Algorithms may become inefficient due to data
volume.
 CPU starvation: Inefficient use of processing power leads to idle
11 CPU time.
 I/O bottlenecks: Reading/writing large data to/from disk is slow.
Techniques:
 Data compression: Use tools like Bcolz to reduce memory usage.
 Chunking: Process data in smaller batches.
 Parallelism: Tools like Dask allow computations across multiple
CPU cores.
 Efficient libraries: Use optimized Python tools like Numexpr,
Numba, and Theano.

Illustrate with examples how missing data is handled using pandas in 5 3 3 5


Python.

Answer:
Techniques:
1. Detect missing data:
 df.isnull().sum()

2. Drop missing data:


 df.dropna() – removes rows with any NaNs.
 df.dropna(axis=1) – removes columns with NaNs.
12
3. Fill missing data:
df.fillna(0)
ex:
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].median(), inplace=True)

4. Forward/Backward fill:

 df.fillna(method='ffill') # Propagate previous value


 df.fillna(method='bfill') # Propagate next value
Explain the difference between merge() and join() functions in pandas 5 2 3 5
with suitable examples.

Feature merge() join()


Basis Joins on columns Joins on index
Flexibility SQL-style joins Simpler syntax
Use Case Combining based on
Dataset joining by keys
index
13 Example merge() on a column join() on index

df = pd.merge(df1, df1.set_index('ID',
df2, on='ID', inplace=True)
how='inner') df2.set_index('ID',
inplace=True)
df1.join(df2,
how='outer')

Demonstrate how to create multiple subplots using Matplotlib and 5 3 4 5


annotate a point in the plot.

Creating subplots:

import matplotlib.pyplot as plt

14 fig, axs = plt.subplots(1, 2) # 1 row, 2 columns


axs[0].plot([1, 2, 3], [4, 5, 6])
axs[1].plot([1, 2, 3], [6, 5, 4])

Annotating a point:
plt.annotate('Peak', xy=(2, 5), xytext=(2, 6),
arrowprops=dict(facecolor='black', arrowstyle='->'))

Explain the purpose and usage of Pair Plots and Joint Plots in Seaborn with 5 3 5 5
example code.

pairplot():
 Displays pairwise relationships in a dataset.
 Useful for exploring patterns and correlations.

Jointplot():

 Combines scatterplot and histograms.


 Shows distribution and relationship of two variables.
15
Example :

import seaborn as sns


df = sns.load_dataset("iris")

# Pair plot
sns.pairplot(df, hue="species")

# Joint plot
sns.jointplot(x='sepal_length', y='sepal_width', data=df)
Part – C (2 x 10 = 20 Marks)
Instructions: Answer ALL questions.
Q. Question Marks BL CO PO PI
No Code
Explain various data wrangling operations such as reshaping, pivoting,
and merging in pandas with examples.

(7 Marks)
Reshaping:
 pivot() – converts long to wide format.
 melt() – converts wide to long format.

Merging:
 merge() – combines datasets using key(s).
 join() – merges using index.
16 a  concat() – appends datasets row/column-wise. 10 2 3 5

Pivot
df.pivot(index='Date', columns='City', values='Sales')

Melt
df.melt(id_vars='Date', var_name='City', value_name='Sales')

Merge
pd.merge(df1, df2, on='ID', how='inner')

Examples to be given (3 marks)


(OR)
Apply different data cleaning techniques such as handling missing data,
standardization, and outlier detection using pandas.

Handling Missing Data:


dropna(), fillna() with mean/median/mode.

Standardization:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['Age']])
16 b 10 3 3 5
Outlier Detection:

IQR method:

Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Age'] < Q1 - 1.5*IQR) | (df['Age'] > Q3 + 1.5*IQR)]

Demonstrate how to create multiple subplots, control axes, and


customize labels and legends using Matplotlib.

Creating Subplots: (5 marks)


fig, axs = plt.subplots(2, 2)
axs[0, 0].plot([1, 2, 3], [4, 5, 6])
17 a 10 2 4 5
Control Axes:
plt.xlim(0, 5)
plt.ylim(0, 10)

 Add xlabel(), ylabel()



Use legend() with labels

Add title with title()
Annotation:

plt.annotate('Point A', xy=(2, 5), xytext=(3, 6),


arrowprops=dict(facecolor='blue', arrowstyle='->'))

All plots to be drawn (5 marks)

(OR)
Construct different Seaborn visualizations including pair plots, scatter
plots, and joint plots, and explain their use in analysis.

(5 marks)
Pair Plot:
Explores multiple variables at once.
sns.pairplot(df)

Scatter Plot:
 Visualizes relationship between two variables.
sns.scatterplot(x='Age', y='Income', data=df)
17 b 10 3 5 5
Joint Plot:
 Combines scatter and histograms for deeper insight.
sns.jointplot(x='Age', y='Income', data=df)

(5marks)
Explain the plots using diagrams.

Use cases:
 Detect correlation
 Identify clusters and trends
 Explore distributions

Course Outcome (CO) and Bloom’s level (BL) Coverage in Questions

CO Coverage
60% 55%

50% 45%

40%

30%

20%

10%

0%
Register
Number

SRM Institute of Science and Technology


Set -
College of Engineering and Technology
School of Computing
SRM Nagar, Kattankulathur – 603203, Chengalpattu District, Tamil Nadu
Academic Year: 2024-25 (EVEN)

Test: FT4 Date: 29-04-2025


Course Code & Title: 21CSS303T-Data Science Duration: Two periods
Year& Sem: III Year /VI Sem Max.Marks:50

Course Articulation Matrix:


Course PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
CO3 - - - - 1 - - - - - - -
CO4 - - - - 1 - - - - - - -
CO5 - - - - 1 - - - - - - -
Note: CO3 – To identify data manipulation and cleaning techniques using pandas
CO4 – To constructs the Graphs and plots to represent the data using python packages
CO5 – To apply the principles of the data science techniques to predict and forecast the outcome of real-
world problem
Part – A (10 x 1 = 10 Marks)
Instructions:
1) Answer ALL questions.
2) The duration for answering Part A is 15 minutes (this sheet will be collected after 15 minutes).
3) Encircle the correct answer.
Question Marks BL CO PO PI
S.No
Code
1 In data wrangling, what does the term “imputation” refer to? 1 1 3 5
A. Dropping columns
B. Filling in missing values
C. Renaming variables
D. Removing duplicates
2 What does df1.join(df2, how='outer') do? 1 1 3 5
A. Performs an outer join on columns
B. Merges df2 into df1 on index, including all entries from both
C. Merges by common column
D. Appends rows
3 What is the output of the code? 1 1 3 5
s = "abcdefghijk"
result = s[8:2:-2]
print(result)
A. "igec"
B. "igda"
C. "igca"
D. "hfdb"

4 In which scenario would the following code fail to detect outliers? 1 2 3 5


z_scores = stats.zscore(data)
outliers = np.where(np.abs(z_scores) > 3)
A. If data is normally distributed
B. If outliers are beyond ±3 standard deviations
C. If outliers are within ±3 standard deviations
D. If data has no variation
5 What is the output of the code? 1 2 3 5
s = "one,two,three,four"
result = "-".join([word.upper() for word in
s.split(",")])
print(result)
A. "ONE-TWO-THREE-FOUR"
B. "one-two-three-four"
C. "ONE,TWO,THREE,FOUR"
D. An error occurs

6 What does this annotation code do? 1 1 4 5


plt.annotate('Peak', xy=(5, 10), xytext=(6, 12),
arrowprops=dict(facecolor='black', shrink=0.05))
A. Adds a legend with an arrow
B. Labels a point and draws an arrow
C. Adds a title to the figure
D. Plots an arrow without annotation

7 Consider the code below that creates a scatter plot with Seaborn: 1 1 4 5
sns.relplot(x="sepal_length", y="sepal_width",
data=iris, hue="species",
kind="scatter", alpha=0.7)
Which of the following statements best explains the use of alpha=0.7?
A. It reduces the marker size.
B. It adjusts the transparency to help visualize overlapping points.
C. It changes the color palette.
D. It increases the line width for plot boundaries.

8 What does the following Matplotlib code snippet do? 1 1 4 5


plt.text(0.5, 0.5, 'Hello, World!', fontsize=14,
rotation=45,ha='center', va='center', color='red')
A. Places the text at the center of the figure with a 45° clockwise rotation
B. Centers the text at (0.5, 0.5) of the axes coordinate system with 45°
rotation and red color
C. Rotates the text by 45° around the origin and aligns left
D. Places the text at data coordinates (0.5, 0.5) with no rotation

9 In the following code snippet, what is the role of the rstride and 1 2 5 5
cstride parameters?
surf = ax.plot_surface(X, Y, Z, cmap='viridis',
rstride=1, cstride=1)
A. They define the number of rows and columns in the data grid.
B. They control the sampling (row and column stride) of the input data
for rendering the surface.
C. They set the resolution of the color mapping.
D. They adjust the transparency of the surface.

10 Consider the following code snippet. What does it accomplish? 1 2 5 5


fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
X, Y = np.meshgrid(np.linspace(-5, 5, 50),
np.linspace(-5, 5, 50))
Z = np.sin(np.sqrt(X**2 + Y**2))
surf = ax.plot_surface(X, Y, Z, cmap='plasma',
edgecolor='none')
A. It creates a wireframe 3D surface plot of a sine function.
B. It generates a smooth 3D surface plot using a sine function with the
'plasma' colormap and no edge lines.
C. It plots a scatter plot of sine values in 3D space.
D. It creates a contour plot on a 3D axis.
Register
Number

SRM Institute of Science and Technology


College of Engineering and Technology Set -
School of Computing
SRM Nagar, Kattankulathur – 603203, Chengalpattu District, Tamil Nadu
Academic Year: 2024-25 (EVEN SEM)

Test: FT4 Date:29-04-2025


Course Code & Title: 21CSS303T-Data Science Duration: Two periods
Year& Sem: III Year /VI Sem Max.Marks:50

Part – B (4 x 5 = 20 Marks)
Instructions: Answer ANY FOUR Questions
Q. Question Marks BL CO PO PI
No Code

11 Explain the process of data wrangling. Describe at least three key 5 2 3 5


steps involved, discuss why data wrangling is important in data
analysis, and provide a brief example to illustrate your answer.
• Data Wrangling is one of those technical terms that are more or
less self-descriptive.
• The term "wrangling" refers to rounding up information in a
certain way.

• Discovery: Before starting the wrangling process, it is critical to


think about what may lie beneath your data.
• Organization: After you've gathered your raw data within a
particular dataset, you must structure your data.
• Cleaning: When your data is organized, you can begin cleaning
your data. Data cleaning involves removing outliers, formatting
nulls, and eliminating duplicate data.
• Data enrichment: This step requires that you take a step back from
your data to determine if you have enough data to proceed.
• Validation: After determining you gathered enough data, you will
need to apply validation rules to your data. Validation rules,
performed in repetitive sequences, confirm that data is consistent
throughout your dataset.
• Publishing: The final step of the data munging process is data
publishing. Data providing notes and documentation of your
wrangling process and creating access for other users and
applications.
Example:
Suppose you have a dataset on customer purchases with the
following columns: customer_id, purchase_date, amount_spent,
and coupon_used. The data may have issues like missing values
in amount_spent, duplicates in customer_id, and inconsistent
date formats.
Steps involved in data wrangling for this example:
1. Remove Duplicates:
data.drop_duplicates(subset='customer_id',
inplace=True)
2. Handle Missing Values:
data['amount_spent'].fillna(data['amount_spent'
].mean(), inplace=True)
3. Convert Date Format:
data['purchase_date'] =
pd.to_datetime(data['purchase_date'],
format='%Y-%m-%d')

12 Explain how merging using indices differs from merging on 5 3 3 5


columns in pandas. In your answer, describe the key steps and
benefits of merging on an index and provide a brief Python code
example to illustrate this method.
In pandas, merging can be done on column values or on index labels,
depending on how your data is structured.
Merging on Columns
This is the default behavior of pd.merge(), where you specify one or
more columns from both DataFrames to match rows.
Example:
import pandas as pd

df1 = pd.DataFrame({'id': [1, 2, 3], 'name':


['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [1, 2], 'score': [85,
90]})

merged = pd.merge(df1, df2, on='id')


print(merged)
output:
id name score
0 1 Alice 85
1 2 Bob 90
Rows are matched where values in the id column are equal.

Merging Using Indices


When merging on indices, pandas uses the row labels (index values) to
align and join rows instead of specific columns. This is done with:
 df1.join(df2) — by default joins on index
 pd.merge(df1, df2, left_index=True,
right_index=True)
Benefits of Merging on Index:
1. Simplifies merging when the index holds meaningful identifiers
(like time series data or grouped keys).
2. Avoids resetting indexes or adding redundant ID columns.
3. Supports hierarchical (multi-level) indices in complex datasets.
Example: Merging on Index

import pandas as pd

# Create two DataFrames with custom indices


df1 = pd.DataFrame({'name': ['Alice', 'Bob',
'Charlie']}, index=[101, 102, 103])
df2 = pd.DataFrame({'score': [88, 92]},
index=[101, 102])

# Merge using index


merged = df1.join(df2) # same as df1.join(df2,
how='left')
print(merged)
Output:
name score
101 Alice 88.0
102 Bob 92.0
103 Charlie NaN
The join is done based on the index values, not a column. Index 103 has
no match, so NaN is inserted.

13 Give a credit risk model for a fintech startup. The dataset includes 5 2 3 5
columns: credit_score, income, loan_amount, defaulted (Yes/No), and
age. Perform the following task to prepare the data for modeling.
a. Group credit_score into risk categories: 'Poor', 'Fair', 'Good',
'Excellent'.
b. Standardize income and loan_amount.
c. Summarize the average loan amount and default rate for each
credit risk category.
d. Explain why binning and standardization are important in this
context.
Step-by-Step Data Preparation
a. Group credit_score into risk categories
categorize credit scores into bins:
import pandas as pd
import numpy as np
# Example DataFrame
df = pd.DataFrame({
'credit_score': [580, 660, 710, 780, 620],
'income': [30000, 45000, 60000, 80000, 35000],
'loan_amount': [5000, 7000, 10000, 12000, 6000],
'defaulted': ['Yes', 'No', 'No', 'No', 'Yes'],
'age': [25, 35, 45, 50, 30]
})

# Define credit score bins


bins = [0, 599, 659, 719, 850]
labels = ['Poor', 'Fair', 'Good', 'Excellent']

# Create risk category


df['risk_category'] = pd.cut(df['credit_score'],
bins=bins, labels=labels)
print(df)

Output:
credit_score income loan_amount defaulted age risk_category
0 580 30000 5000 Yes 25 Poor
1 660 45000 7000 No 35 Good
2 710 60000 10000 No 45 Good
3 780 80000 12000 No 50 Excellent
4 620 35000 6000 Yes 30 Fair
b. Standardize income and loan_amount
Standardization centers values to a mean of 0 and a standard deviation of
1:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['income_scaled', 'loan_amount_scaled']] =
scaler.fit_transform(df[['income', 'loan_amount']])
print(df)
Output:
credit_score income loan_amount defaulted age risk_category \
0 580 30000 5000 Yes 25 Poor
1 660 45000 7000 No 35 Good
2 710 60000 10000 No 45 Good
3 780 80000 12000 No 50 Excellent
4 620 35000 6000 Yes 30 Fair

income_scaled loan_amount_scaled
0 -1.100964 -1.150447
1 -0.275241 -0.383482
2 0.550482 0.766965
3 1.651446 1.533930
4 -0.825723 -0.766965
c. Summarize average loan and default rate per risk category
# Convert 'defaulted' to binary
df['defaulted_binary'] = df['defaulted'].map({'Yes':
1, 'No': 0})

# Group by credit risk


summary = df.groupby('risk_category').agg({
'loan_amount': 'mean',
'defaulted_binary': 'mean'
}).rename(columns={
'loan_amount': 'avg_loan_amount',
'defaulted_binary': 'default_rate'
})

print(summary)

output:
avg_loan_amount default_rate
risk_category
Poor 5000.0 1.0
Fair 6000.0 1.0
Good 8500.0 0.0
Excellent 12000.0 0.0

d. Why are binning and standardization important?


🔹 Binning (Grouping Credit Scores):
 Simplifies modeling by converting continuous scores into
understandable categories.
 Enables models and stakeholders to easily interpret risk levels
("Fair", "Good", etc.).
 Helps capture non-linear relationships between credit score and
default probability.
🔹 Standardization:
 Ensures numerical features like income and loan amount are on
the same scale.
 Crucial for algorithms sensitive to scale
 Prevents high-magnitude variables from dominating model
weights.

14 Write a Python program using Matplotlib to create a single figure 5 3 4 5


with three subplots arranged in 1 row and 3 columns. Plot the
following functions in each subplot:
1. First subplot: plot y=x
2. Second subplot: plot y=x2
3. Third subplot: plot y=x3
Use the range x=-10 to x=10 for all plots. Add titles to each subplot
and label the x and y axes appropriately.
import matplotlib.pyplot as plt
import numpy as np

# Define the range of x values


x = np.linspace(-10, 10, 400)

# Define y values for each function


y1 = x
y2 = x**2
y3 = x**3

# Create a figure and subplots


fig, axes = plt.subplots(1, 3, figsize=(18, 5)) # 1
row, 3 columns

# First subplot: y = x
axes[0].plot(x, y1, color='blue')
axes[0].set_title('Plot of y = x')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')

# Second subplot: y = x^2


axes[1].plot(x, y2, color='green')
axes[1].set_title('Plot of y = x²')
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')

# Third subplot: y = x^3


axes[2].plot(x, y3, color='red')
axes[2].set_title('Plot of y = x³')
axes[2].set_xlabel('x')
axes[2].set_ylabel('y')

# Adjust layout to prevent overlapping


plt.tight_layout()

# Display the plots


plt.show()

output:
15 Write a Python program that demonstrates the use of 3D plotting by doing 5 3 5 5
the following:
 Create a 3D plot using any mathematical function or parametric
equations of your choice.
 Plot the data using a 3D axis (ax = fig.add_subplot(...,
projection='3d')).
 Customize the plot using color maps, line styles, or markers for
better visualization.
import numpy as np
import matplotlib.pyplot as plt

# Create the figure and 3D axes


fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

# Generate data for x, y


x = np.linspace(-6, 6, 100)
y = np.linspace(-6, 6, 100)
X, Y = np.meshgrid(x, y)
Z = np.sin(np.sqrt(X**2 + Y**2))

# Plot the surface with a color map


surf = ax.plot_surface(X, Y, Z, cmap='plasma',
edgecolor='k', linewidth=0.5, antialiased=True)

# Add a color bar for reference


fig.colorbar(surf, ax=ax, shrink=0.5, aspect=10)

# Customize labels
ax.set_title('3D Surface Plot of z = sin(sqrt(x² +
y²))', fontsize=14)
ax.set_xlabel('X-axis')
ax.set_ylabel('Y-axis')
ax.set_zlabel('Z-axis')

# Adjust view angle


ax.view_init(elev=30, azim=45)

# Show the plot


plt.tight_layout()
plt.show()

Output:
Part – C (2 x 10 = 20 Marks)
Instructions: Answer ALL questions.

Q. Question Marks BL CO PO PI
No Code
16 a Consider the basic dataset that contains student details collected 10 2 3 5
during admissions. The dataset contains errors and inconsistencies
that need to be addressed before it can be used for reporting and
visualization.
student_id Name Age Email grade
1 John Smith 20 john.smith@email.com A
2 SARA -1 sara123@email.com B+
3 Riya Kapoor NaN riya_kapoor@gmail A
4 Tom Brown 19 tom.brown@email.com None
5 22 B
6 alex johnson 0 alex.j@email.com A+
Write Python code to perform the following data cleaning
operations:
a. Identify and remove rows where the name or email is
missing or blank.
b. Replace invalid age values (e.g., 0, -1, or NaN) with the
mean age of valid entries.
c. Strip extra spaces in the name column and convert all
names to proper title case.
d. Standardize grade values by replacing None with
"Incomplete".
e. Remove rows with invalid email addresses (those without
"@" or a "." after the "@").
f. Display a summary of the cleaned dataset using
df.describe() or df.info().
g. Explain two potential risks if this dataset is used in its raw
form for decision-making.

Python code:

import pandas as pd
import numpy as np
data = {
'student_id': [1, 2, 3, 4, 5, 6],
'Name': ['John Smith', 'SARA', 'Riya
Kapoor', 'Tom Brown', '', 'alex johnson'],
'Age': [20, -1, np.nan, 19, 22, 0],
'Email': ['john.smith@email.com',
'sara123@email.com', 'riya_kapoor@gmail',
'tom.brown@email.com', '',
'alex.j@email.com'],
'grade': ['A', 'B+', 'A', None, 'B', 'A+']
}
df = pd.DataFrame(data)
print(df)
Identify and remove rows where the name or email is missing
or blank.
df = df[(df['Name'].notna()) &
(df['Name'].str.strip() != '') &
(df['Email'].notna()) &
(df['Email'].str.strip() != '')]
print(df)
output:
student_id Name Age Email grade
0 1 John Smith 20.0 john.smith@email.com A
1 2 SARA -1.0 sara123@email.com B+
2 3 Riya Kapoor NaN riya_kapoor@gmail A
3 4 Tom Brown 19.0 tom.brown@email.com None
5 6 alex johnson 0.0 alex.j@email.com A+
Replace invalid age values (e.g., 0, -1, or NaN) with the mean
age of valid entries.
valid_ages = df['Age'][df['Age'] > 0]
mean_age = valid_ages.mean()
df['Age'] = df['Age'].apply(lambda x: mean_age
if pd.isna(x) or x <= 0 else x)
print(df)
output:
student_id Name Age Email grade
0 1 John Smith 20.0 john.smith@email.com A
1 2 SARA 19.5 sara123@email.com B+
2 3 Riya Kapoor 19.5 riya_kapoor@gmail A
3 4 Tom Brown 19.0 tom.brown@email.com None
5 6 alex johnson 19.5 alex.j@email.com A+
Strip extra spaces in the name column and convert all names to
proper title case.
df['Name'] =
df['Name'].str.strip().str.title()
print(df)
Output:
student_id Name Age Email grade
0 1 John Smith 20.0 john.smith@email.com A
1 2 Sara 19.5 sara123@email.com B+
2 3 Riya Kapoor 19.5 riya_kapoor@gmail A
3 4 Tom Brown 19.0 tom.brown@email.com None
5 6 Alex Johnson 19.5 alex.j@email.com A+
Standardize grade values by replacing None with "Incomplete".
df['grade'] = df['grade'].fillna('Incomplete')
print(df)
Output:
student_id Name Age Email grade
0 1 John Smith 20.0 john.smith@email.com A
1 2 Sara 19.5 sara123@email.com B+
2 3 Riya Kapoor 19.5 riya_kapoor@gmail A
3 4 Tom Brown 19.0 tom.brown@email.com Incomplete
5 6 Alex Johnson 19.5 alex.j@email.com A+
Remove rows with invalid email addresses (those without "@"
or a "." after the "@").
def is_valid_email(email):
if "@" in email:
local, _, domain =
email.partition("@")
return "." in domain
return False
df = df[df['Email'].apply(is_valid_email)]
print(df)
Output:
student_id Name Age Email grade
0 1 John Smith 20.0 john.smith@email.com A
1 2 Sara 19.5 sara123@email.com B+
3 4 Tom Brown 19.0 tom.brown@email.com Incomplete
5 6 Alex Johnson 19.5 alex.j@email.com A+
Explain two potential risks if this dataset is used in its raw
form for decision-making.

 Misleading Insights Due to Invalid or Missing Data


If such data is used to analyze age distributions, assign age-based
benefits, or segment students demographically, it could lead to biased or
incorrect conclusions. For example, a scholarship program for students
over 18 might be inaccurately designed based on the skewed average
age.
 Communication Failures and Operational Errors
Using this data for sending admission decisions or updates could lead
to failed communications or privacy issues (e.g., emails sent to the
wrong recipients). This undermines trust in institutional processes and
may result in lost opportunities or legal liability.
(OR)

16 b Given two datasets: 10 3 3 5


customers.csv
Customer_ID Name Age City
C001 Alice 30 New York
C002 Bob 45 Chicago
C003 Charlie 35 San Diego
transactions.csv
Customer_ID Date Purchase_Amount
C001 2024-10-01 250
C002 2024-10-02 100
C004 2024-10-02 300
a. Write the code to merge customers.csv with transactions.csv
using Customer_ID.
b. Explain the difference between inner, left, and outer joins in
this context.
c. Use pd.concat() to vertically combine the customers and a new
small DataFrame with more customer entries.
d. Explain how .combine_first() works and when it is useful.
e. Briefly explain the use of .stack() and .unstack() in reshaping
hierarchical indexes

a. Code to merge customers.csv with transactions.csv using


Customer_ID:
import pandas as pd

# Simulating the datasets


customers = pd.DataFrame({
'Customer_ID': ['C001', 'C002', 'C003'],
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [30, 45, 35],
'City': ['New York', 'Chicago', 'San Diego']
})

transactions = pd.DataFrame({
'Customer_ID': ['C001', 'C002', 'C004'],
'Date': ['2024-10-01', '2024-10-02', '2024-
10-02'],
'Purchase_Amount': [250, 100, 300]
})

# Merging on Customer_ID
merged_df = pd.merge(customers, transactions,
on='Customer_ID')
print(merged_df)

b. Difference between inner, left, and outer joins in this context:


Join Type Description Result
Only includes rows
Inner Join Drops C003 (no
with matching
(how='inne transaction) and C004 (not
Customer_ID in both
r') in customers).
DataFrames.
Keeps all rows from
Keeps C001, C002, C003;
Left Join customers, adds
C003 will have NaNs for
(how='left') matching transactions
transaction columns.
if available.
Keeps all customer and
Includes all rows
Outer Join transaction entries (C001,
from both
(how='oute C002, C003, C004).
DataFrames, matches
r') Unmatched parts get
where possible.
NaNs.

c. Combine customers with new customers using pd.concat():


new_customers = pd.DataFrame({
'Customer_ID': ['C005', 'C006'],
'Name': ['David', 'Eva'],
'Age': [29, 41],
'City': ['Houston', 'Seattle']
})
all_customers = pd.concat([customers,
new_customers], ignore_index=True)
print(all_customers)
d. Explanation of .combine_first():
.combine_first() is used to fill missing values in a DataFrame
with values from another DataFrame with the same index and
columns.
If df1 has missing values and df2 has some overlapping
rows/columns with non-null values, you can write:
df_combined = df1.combine_first(df2)
It fills in missing values in df1 with corresponding values from
df2.
Useful for: filling gaps in incomplete data from a backup or
fallback dataset.

e. Brief explanation of .stack() and .unstack() for reshaping:


 .stack(): Converts columns into rows; it moves the inner level
of columns to rows, producing a Series with a MultiIndex.
o Useful to long-form reshape a DataFrame.
 .unstack(): Does the reverse—it pivots the inner row index
level to columns.
o Converts a hierarchical index DataFrame into a wide
format.
Example:
df = pd.DataFrame({
'Category': ['A', 'A', 'B', 'B'],
'Type': ['X', 'Y', 'X', 'Y'],
'Value': [10, 20, 30, 40]
}).set_index(['Category', 'Type'])

# Stack moves 'Value' to inner row index


stacked = df.stack()
# Unstack moves 'Type' to column level
unstacked = df.unstack()

17 a Explain the functionalities and plotting techniques provided by the 10 2 4 5


Seaborn library in Python. Discuss its advantages over Matplotlib
and describe in detail at least three major types of plots with
appropriate code examples and use cases. Also, explain how
Seaborn handles datasets using built-in functions and how it
integrates with Pandas for effective data visualization.

Seaborn is a high-level Python data visualization library built on top of


Matplotlib and tightly integrated with Pandas. It provides an interface
for drawing attractive and informative statistical graphics with just a
few lines of code.

Key Functionalities of Seaborn


1. Statistical Plotting: Supports regression, distribution,
categorical, and matrix plots.
2. Automatic Aesthetics: Uses beautiful default themes and
color palettes.
3. Pandas Integration: Accepts DataFrames directly and uses
column names for axes, hue, style, etc.
4. Built-in Datasets: Offers sample datasets for practice (e.g.,
tips, iris, penguins).
5. Faceting: Easily creates subplots by category (with FacetGrid,
catplot, etc.).
6. Aggregation: Aggregates data behind the scenes for
meaningful summaries (e.g., barplot shows mean by default).

Advantages Over Matplotlib


Feature Seaborn Matplotlib
High-level API, less Low-level, more manual
Ease of Use
code configuration
Built-in
Aggregatio Yes (e.g., mean, CI) No
n
Better default styling Requires manual
Aesthetics
and themes customization
Pandas Seamless (df, col Requires conversion or
Integration names) manual mapping
Statistical Built-in regression, Needs manual setup or
Tools KDE, violin plots SciPy

Three Major Plot Types with Code and Use Cases


1. Distribution Plot (sns.histplot, sns.kdeplot)
Used for analyzing the distribution of a numeric variable.
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# Sample data
df = pd.DataFrame({'Age': [22, 25, 30, 30, 35,
40, 45, 50, 55, 60]})

# Histogram with KDE


sns.histplot(df['Age'], kde=True, bins=5)
plt.title("Age Distribution with KDE")
plt.show()
2. Categorical Plot (sns.boxplot, sns.violinplot, sns.barplot)
Used for comparing distributions or aggregated values across
categories.
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# Sample data with outliers


data = {
"A": [1, 2, 3, 4, 5, 30], # 30 is an
outlier
"B": [2, 4, 6, 8, 7, 28], # 28 is an
outlier
"C": [3, 6, 9, 5, 2, 7]
}
# Convert data to DataFrame for better
visualization
df = pd.DataFrame(data)

# Create a box plot with outliers explicitly


shown
sns.boxplot(data=df, showmeans=True, whis=1.5)

# Add a title and labels


plt.title("Box Plot with Outliers")
plt.xlabel("Columns")
plt.ylabel("Values")

# Show the plot


plt.show()

3. Relational Plot (sns.scatterplot, sns.lineplot)


Visualizes relationships between two numeric variables.
# Scatterplot
sns.scatterplot(data=tips, x='total_bill',
y='tip', hue='sex', style='smoker')
plt.title("Tip vs Total Bill")
plt.show()
 hue adds color for a third variable.
 style changes markers for different categories.

Built-in Dataset Handling


Seaborn provides a variety of built-in datasets for practice, accessible
sns.get_dataset_names() # List available
datasets
df = sns.load_dataset('iris') # Load a dataset
as a DataFrame
These datasets are automatically returned as Pandas DataFrames,
making them easy to explore and plot without extra loading steps.

Integration with Pandas


Seaborn is pandas-aware, meaning:
 You can pass entire DataFrames to functions.
 Specify variables with column names (x='col1', y='col2').
 Use groupby-like semantics via hue, col, row for easy faceting.
 Automatically handles missing values and categorical data.
Example: Multiple plots with Pandas-style semantics
sns.catplot(data=tips, x='day',
y='total_bill', hue='sex', kind='box')
plt.show()

(OR)
17 b Describe annotation techniques used in data visualization using 10 3 5 5
Python. Explain the importance of annotations in plots and
demonstrate how annotations can be added using Matplotlib and
Seaborn with appropriate code examples. Include different types of
annotations such as text, arrows, and labels on bar charts, line plots,
and scatter plots.
Annotations are crucial in data visualization as they help highlight
important information, clarify data points, and guide interpretation. In
Python, both Matplotlib and Seaborn support annotation techniques—
since Seaborn builds on Matplotlib, annotations typically use
Matplotlib's functions under the hood.

Importance of Annotations in Plots


 Emphasize key data points (e.g., max/min values, outliers).
 Explain trends in time series or correlations.
 Label elements in bar or scatter plots.
 Make plots more informative and presentation-ready.

Annotation Techniques in Matplotlib


1. Adding Text with plt.text()
import matplotlib.pyplot as plt

x = [1, 2, 3, 4]
y = [10, 20, 25, 30]

plt.plot(x, y, marker='o')
plt.text(2, 20, 'Second Point', fontsize=12,
color='red')
plt.title("Text Annotation Example")
plt.show()

2. Using plt.annotate() with Arrows


plt.plot(x, y, marker='o')
plt.annotate(
'Highest Point',
xy=(4, 30), # Point to annotate
xytext=(2.5, 35), # Text location
arrowprops=dict(facecolor='black',
arrowstyle='->'),
fontsize=12
)
plt.title("Arrow Annotation Example")
plt.show()

3.Annotations in Bar Charts


Bar Chart with Text Labels
categories = ['A', 'B', 'C']
values = [10, 15, 7]

plt.bar(categories, values)
for i, v in enumerate(values):
plt.text(i, v + 0.5, str(v), ha='center',
fontweight='bold')
plt.title("Bar Chart with Value Labels")
plt.show()

Annotations in Scatter Plots using Seaborn


import seaborn as sns
import pandas as pd

# Sample data
df = sns.load_dataset('tips')
sns.scatterplot(data=df, x='total_bill',
y='tip')

# Annotate a specific point


max_tip = df.loc[df['tip'].idxmax()]
plt.annotate(
f"Max Tip: {max_tip['tip']}",
xy=(max_tip['total_bill'],
max_tip['tip']),
xytext=(max_tip['total_bill'] + 5,
max_tip['tip'] + 2),
arrowprops=dict(facecolor='green',
shrink=0.05)
)
plt.title("Scatter Plot with Annotation")
plt.show()

Annotations in Line Plots


days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri']
sales = [200, 220, 180, 260, 300]

plt.plot(days, sales, marker='o')


plt.title("Sales Over a Week")

# Annotate peak
plt.annotate(
'Peak Sales',
xy=('Fri', 300),
xytext=('Wed', 310),
arrowprops=dict(arrowstyle='->',
color='red'),
color='red'
)
plt.show()

Annotation Techniques
Technique Function Use Case
plt.text() Add static text Labeling bars or points
plt.annotate() Text + arrows Highlighting specific features
Bar label
ax.bar_label() Labeling each bar
shortcut
Seaborn + Highlights in
Same as Matplotlib, post-plot
annotate plots

Course Outcome (CO) and Bloom’s level (BL) Coverage in Questions

CO Coverage
60% 55%

50% 45%

40%

30%

20%

10%

0%
Register
Number

SRM Institute of Science and Technology


Set -
College of Engineering and Technology
School of Computing
SRM Nagar, Kattankulathur – 603203, Chengalpattu District, Tamil Nadu
Academic Year: 2024-25 (EVEN)

Test: FT4 Date: 29-04-2025


Course Code & Title: 21CSS303T-Data Science Duration: Two periods
Year& Sem: III Year /VI Sem Max.Marks:50

Course Articulation Matrix:


Course PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
CO3 - - - - 1 - - - - - - -
CO4 - - - - 1 - - - - - - -
CO5 - - - - 1 - - - - - - -
Note: CO3 – To identify data manipulation and cleaning techniques using pandas
CO4 – To constructs the Graphs and plots to represent the data using python packages
CO5 – To apply the principles of the data science techniques to predict and forecast the outcome of real-
world problem
Part – A (10 x 1 = 10 Marks)
Instructions:
1) Answer ALL questions.
2) The duration for answering Part A is 15 minutes (this sheet will be collected after 15 minutes).
3) Encircle the correct answer.
Question Marks BL CO PO PI
S.No
Code
1 Which of the following methods is used to remove duplicate rows from a 1 1 3 5
Data Frame in pandas?
a) drop()
b) drop_duplicates()
c) unique()
d) remove_duplicates()

2 What function is used to fill missing values in a pandas Data Frame? 1 1 3 5


a) fillna()
b) replace_null()
c) na_fill()
d) fill()

3 Which of the following is NOT a method for handling missing data? 1 1 3 5


a) Deletion
b) Imputation
c) Forward/Backward fill
d) Duplicating

4 When preparing data for modeling, why is scaling important? 1 2 3 5


a) To hide patterns
b) To reduce memory
c) To ensure equal importance of features
d) To convert text to numbers

5 Which of these is NOT a standard data cleaning step? 1 2 3 5


a) Handling missing values
b) Removing duplicates
c) Building machine learning models
d) Correcting data types

6 Which function is used to set the x-axis label in matplotlib? 1 1 4 5


a) plt.labelx()
b) plt.xlabel()
c) plt.xaxis()
d) plt.set_xlabel()
7 Which method is used to add a legend to the plot? 1 1 4 5
a) plt.add_legend()
b) plt.show_legend()
c) plt.legend()
d) plt.make_legend()
8 Which function is used to display multiple plots in one figure? 1 1 4 5
a) plt.split()
b) plt.multi_plot()
c) plt.subplot()
d) plt.div()
9 What function is used to add custom text annotations to a plot? 1 2 5 5
a) plt.comment()
b) plt.annotate()
c) plt.tag()
d) plt.label()
10 Which function sets the size of the overall figure in matplotlib? 1 2 5 5
a) plt.resize()
b) plt.figure(figsize=(w, h))
c) plt.set_size()
d) plt.figsize()
Register
Number

SRM Institute of Science and Technology


College of Engineering and Technology Set -
School of Computing
SRM Nagar, Kattankulathur – 603203, Chengalpattu District, Tamil Nadu
Academic Year: 2024-25 (EVEN SEM)

Test: FT4 Date:29-04-2025


Course Code & Title: 21CSS303T-Data Science Duration: Two periods
Year& Sem: III Year /VI Sem Max.Marks:50

Part – B (4 x 5 = 20 Marks)
Instructions: Answer ANY FOUR Questions
Q. Question Marks BL CO PO PI
No Code

5 2 3 5
11 Discuss the various methods of handling missing data in the dataset.
Listwise Deletion: Remove entire rows or columns containing missing
values. This method is simple but can result in a significant loss of data,
especially if there are many missing values.

Pairwise Deletion: Remove pairs of observations where at least one value


is missing. This is less wasteful than listwise deletion but can introduce
bias if missingness is not random.
Imputation:
Mean/Median/Mode Imputation: Replace missing values with the mean,
median, or mode of the respective column. This is a simple approach but
can introduce bias if the distribution is skewed.

K-Nearest Neighbors (KNN) Imputation: Impute missing values using the


average values of the k nearest neighbors. This method can be effective
for numerical data.

Regression Imputation: Use regression models to predict missing values


based on other features. This is suitable for numerical data with strong
relationships between features.

Multiple Imputation: Create multiple imputed datasets by filling in


missing values with different plausible values. This method can help to
account for uncertainty in the imputation process.
3. Using a "Missing" Category (For Categorical Data)
In cases of categorical variables, instead of filling in missing values with
a mode or using imputation, you can create a new category or label
indicating that the data is missing.

For example, for a column like Color, if there are missing values, you can
replace them with "Unknown" or "Missing".
Pandas method: df['Color'].fillna('Unknown').
Preserves information about the missingness.
This method can potentially introduce noise, as the new category may not
represent an actual value.

4. Using Algorithms That Handle Missing Data


Some machine learning algorithms, such as XGBoost, Random Forests,
and CatBoost, can handle missing data internally during training without
requiring explicit imputation.
These models can work directly with missing values by learning from the
patterns of the data.

Not all algorithms can handle missing data natively.

Results may vary depending on the implementation and how the algorithm
handles the missing values.

5. Multiple Imputation
This technique involves creating multiple datasets with different imputed
values and then combining the results to account for uncertainty in the
imputation process.

Typically used when data are missing in a random or non-random fashion.

Methods like Multiple Imputation by Chained Equations (MICE) are


available in libraries like statsmodels and fancyimpute.

6. Predictive Modeling (Advanced Imputation)


Use machine learning algorithms (e.g., regression, decision trees) to
predict missing values based on other features.

A model is trained using the non-missing data and then used to predict
missing values.
7. Leave Missing Values As-Is (For Some Models)
In some cases, particularly when using deep learning models, it may be
acceptable to leave missing values as they are and let the model learn how
to handle them during training.

Models like neural networks can handle missing data if they are explicitly
designed to do so.

May lead to poor model performance if the model does not handle missing
values well.

5 3 3 5
12 Explain various data transformation techniques used in data
preprocessing.

Data smoothing is a process that is used to remove noise from the


dataset using some algorithms.
It allows for highlighting important features present in the dataset.
It helps in predicting the patterns.
When collecting data, it can be manipulated to eliminate or reduce any
variance or any other noise form.
Attribute Construction:
In the attribute construction method, the new attributes consult the
existing attributes to construct a new data set that eases data mining.
New attributes are created and applied to assist the mining process from
the given attributes. This simplifies the original data and makes the
mining more efficient.

Data Generalization:
Data generalization is the process of converting detailed data into a more
abstract, higher-level representation while retaining essential
information.
It is commonly used in data mining, privacy preservation, and machine
learning to reduce complexity and improve model generalization.
Types
Attribute Generalization
Hierarchical Generalization
Numeric Generalization
Text Generalization

Data collection or aggregation is the method of storing and presenting


data in a summary format.
The data may be obtained from multiple data sources to integrate these
data sources into a data analysis description. This is a crucial step since
the accuracy of data analysis insights is highly dependent on the quantity
and quality of the data used.

Data discretization is the process of converting continuous numerical


data into discrete categories (bins).
It is commonly used in machine learning, data mining, and feature
engineering to simplify models and improve Interpretability.
# Sample dataset
data = {'Age': [22, 25, 30, 35, 40, 45, 50, 55, 60]}
df = pd.DataFrame(data)
# Equal-width binning into 3 categories df['Age_Binned'] =
pd.cut(df['Age'], bins=3, labels=['Young', 'Middle-aged', 'Old’])
print(df)

Data normalization is a preprocessing technique used to scale numerical


data into a specific range, usually [0,1] or [-1,1].
It ensures that features contribute equally to a model, preventing bias due
to different scales.
Why Normalize Data?
✅ Improves Machine Learning Performance – Many algorithms (e.g.,
KNN, SVM, Neural Networks) perform better with normalized data.
✅ Speeds Up Convergence – Gradient descent optimizes faster when
features are scaled.
✅ Prevents Dominance of Large-Scale Features – Avoids a situation
where one feature overpowers others.

5 2 3 5
13 Write a Python program that accepts a sentence from the user and
performs the following string operations:
1. Display the total number of words in the sentence.
2. Convert the entire sentence to title case (first letter capitalized).
3. Find and display the number of times the word 'the' appears
(case insensitive).
4. Replace all occurrences of the word 'and' with '&'.
# Accept sentence from the user
sentence = input("Enter a sentence: ")

# 1. Display the total number of words in the sentence


words = sentence.split()
num_words = len(words)
print(f"Total number of words: {num_words}")

# 2. Convert the entire sentence to title case


title_case_sentence = sentence.title()
print(f"Sentence in title case: {title_case_sentence}")

# 3. Find and display the number of times the word 'the' appears (case
insensitive)
word_count_the = sentence.lower().split().count('the')
print(f"Number of times the word 'the' appears: {word_count_the}")

# 4. Replace all occurrences of the word 'and' with '&'


replaced_sentence = sentence.replace('and', '&').replace('AND', '&')
print(f"Sentence with 'and' replaced by '&': {replaced_sentence}")

5 3 4 5
14 Explain the concept of subplots in Matplotlib with suitable examples.

In **Matplotlib**, **subplots** are a way of organizing multiple plots


in a single figure. This is useful when you want to display more than one
plot side by side or in a grid, making it easier to compare data and results
visually. The concept of subplots allows you to create a grid layout of
multiple axes (individual plots) within a single figure.

### `plt.subplot()` vs `plt.subplots()`

Matplotlib provides two main functions for creating subplots:


- **`plt.subplot()`**: This function creates a single subplot in a specific
position within a grid.
- **`plt.subplots()`**: This function creates multiple subplots at once,
returning both the figure and axes objects.

#### 1. **`plt.subplot()`**
The `subplot()` function divides the figure into a grid and places a
subplot in a specific position within that grid.

**Syntax**:
```python
plt.subplot(nrows, ncols, index)
```

- `nrows`: Number of rows in the grid.


- `ncols`: Number of columns in the grid.
- `index`: Index of the subplot to create (counting starts from 1).

**Example**:
```python
import matplotlib.pyplot as plt

# Create a 2x2 grid of subplots and plot different graphs in each.


plt.subplot(2, 2, 1) # Row 1, Column 1
plt.plot([1, 2, 3], [1, 4, 9])
plt.title('Plot 1')
plt.subplot(2, 2, 2) # Row 1, Column 2
plt.plot([1, 2, 3], [9, 4, 1])
plt.title('Plot 2')

plt.subplot(2, 2, 3) # Row 2, Column 1


plt.plot([1, 2, 3], [1, 2, 3])
plt.title('Plot 3')

plt.subplot(2, 2, 4) # Row 2, Column 2


plt.plot([1, 2, 3], [3, 2, 1])
plt.title('Plot 4')

plt.tight_layout() # Adjusts layout to avoid overlap


plt.show()
```

**Output**: A 2x2 grid with four different plots.

#### 2. **`plt.subplots()`**
The `subplots()` function creates a grid of subplots and returns both the
**figure** and **axes** objects. This is a more flexible and modern
approach compared to `plt.subplot()`, especially when working with
multiple subplots.

**Syntax**:
```python
fig, axes = plt.subplots(nrows, ncols)
```

- `nrows`: Number of rows in the grid.


- `ncols`: Number of columns in the grid.
- `fig`: The figure object.
- `axes`: A 2D array of axes objects (or a 1D array if there's only one
row or column).

**Example**:
```python
import matplotlib.pyplot as plt

# Create a 2x2 grid of subplots


fig, axes = plt.subplots(2, 2)

# Plot on the first subplot


axes[0, 0].plot([1, 2, 3], [1, 4, 9])
axes[0, 0].set_title('Plot 1')

# Plot on the second subplot


axes[0, 1].plot([1, 2, 3], [9, 4, 1])
axes[0, 1].set_title('Plot 2')

# Plot on the third subplot


axes[1, 0].plot([1, 2, 3], [1, 2, 3])
axes[1, 0].set_title('Plot 3')

# Plot on the fourth subplot


axes[1, 1].plot([1, 2, 3], [3, 2, 1])
axes[1, 1].set_title('Plot 4')

plt.tight_layout() # Adjusts layout to avoid overlap


plt.show()
```

**Output**: A 2x2 grid with four different plots.

### Benefits of `plt.subplots()` over `plt.subplot()`


- **Better Organization**: With `plt.subplots()`, the axes are returned as
a 2D array, making it easy to access each subplot programmatically.
- **Flexibility**: You can create more complex subplot layouts, such as
grids of different sizes.
- **Cleaner Code**: `plt.subplots()` automatically handles figure
creation and axes layout, reducing the need to call `plt.figure()` and
`plt.subplot()` repeatedly.

### 3. **Advanced Customizations**

You can further customize the appearance of subplots using:


- **`figsize`**: Set the size of the figure (width, height) when calling
`plt.subplots()`.
- **`tight_layout()`**: Adjusts spacing between subplots to avoid
overlap.
- **Sharing Axes**: You can share axes between subplots using the
`sharex` and `sharey` parameters in `plt.subplots()`.

#### Example with `figsize`, `tight_layout()`, and `sharex`/`sharey`:


```python
import matplotlib.pyplot as plt

# Create a 2x2 grid of subplots with shared x and y axes


fig, axes = plt.subplots(2, 2, figsize=(10, 6), sharex=True, sharey=True)

# Plot on the first subplot


axes[0, 0].plot([1, 2, 3], [1, 4, 9])
axes[0, 0].set_title('Plot 1')

# Plot on the second subplot


axes[0, 1].plot([1, 2, 3], [9, 4, 1])
axes[0, 1].set_title('Plot 2')

# Plot on the third subplot


axes[1, 0].plot([1, 2, 3], [1, 2, 3])
axes[1, 0].set_title('Plot 3')

# Plot on the fourth subplot


axes[1, 1].plot([1, 2, 3], [3, 2, 1])
axes[1, 1].set_title('Plot 4')

plt.tight_layout() # Adjust layout to avoid overlap


plt.show()
```

5 3 5 5
15 Define annotations in the context of data visualization using Matplotlib
and briefly explain the types of annotations used.

In the context of data visualization, annotations in Matplotlib are


used to add explanatory text, labels, arrows, or other elements to
the plot to provide additional context or emphasize important
points. Annotations help to make the plot more informative,
guiding the audience's attention to specific details or key data
points.
Annotations can be used to:

 Explain the meaning of data points


 Highlight specific trends or patterns
 Add contextual information (e.g., labels, titles, or
descriptions)
 Provide insight into outliers or unusual data points

Key Types of Annotations in Matplotlib

1. Text Annotations: Text annotations are used to place text


at specific locations within the plot to describe points,
trends, or any other important aspect.

Syntax:
plt.text(x, y, 'Text', fontsize=12,
color='red', ha='center', va='center')

o x, y: Coordinates where the text will be placed.


o 'Text': The actual text to display.
o fontsize: Font size of the text.
o color: Color of the text.
o ha, va: Horizontal and vertical alignment of the
text.

Example:
import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
plt.plot(x, y)
plt.text(3, 20, 'This is a point', fontsize=12,
color='blue', ha='left')
plt.show()

2. Arrow Annotations: Arrow annotations help direct


attention to a specific point or region of the plot. These are
useful when you want to point out a specific feature in the
graph.

Syntax:
plt.annotate('Text', xy=(x, y),
xytext=(x_offset, y_offset),
arrowprops=dict(facecolor='blue', arrowstyle='-
>'))

o xy: The coordinates of the point to annotate.


o xytext: The coordinates of the text.
o arrowprops: A dictionary that specifies properties
of the arrow (e.g., color, style).

Example:
import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
plt.plot(x, y)
plt.annotate('Point (3, 9)', xy=(3, 9),
xytext=(4, 10),
arrowprops=dict(facecolor='red',
arrowstyle='->'))
plt.show()

3. Bounding Box Annotations: A bounding box is a box


drawn around the annotation text to highlight it. This is
helpful to make sure the text stands out clearly against the
background.

Syntax:
plt.text(x, y, 'Text',
bbox=dict(facecolor='yellow', alpha=0.5))

o bbox: A dictionary specifying the box properties


(e.g., facecolor, alpha for transparency).

Example:
import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
plt.plot(x, y)
plt.text(3, 20, 'This is a point', fontsize=12,
color='blue', bbox=dict(facecolor='yellow',
alpha=0.5))
plt.show()

4. Multiple Annotations with plt.annotate(): The


annotate() function is versatile and can also be used to
annotate multiple points on the same plot. You can specify
the text and coordinates dynamically, creating a more
detailed and interactive plot.

Example:
import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
plt.plot(x, y)

# Annotating multiple points


for i in range(len(x)):
plt.annotate(f'({x[i]}, {y[i]})', xy=(x[i],
y[i]), xytext=(x[i]+0.1, y[i]+1),

arrowprops=dict(facecolor='green',
arrowstyle='->'))

plt.show()

5. Highlighting Specific Points: Annotations can also be


used to highlight specific points with different markers or
styles (e.g., circles, squares, etc.).

Example:
import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
plt.plot(x, y)
plt.scatter([3], [9], color='red', s=100) #
Highlight a specific point
plt.text(3, 9, 'Highlighted Point',
fontsize=12, color='black', ha='center',
va='center')
plt.show()

Part – C (2 x 10 = 20 Marks)
Instructions: Answer ALL questions.

Q. Question Marks BL CO PO PI
No Code
10 2 3 5
16 a Discuss the major challenges encountered while working with large
datasets and how these challenges impact data preprocessing, storage,
and analysis.

Working with large datasets presents several challenges that can


significantly impact the stages of data preprocessing, storage,
and analysis. These challenges can arise from the sheer volume
of data, the variety of data types, and the complexity of
processing that data. Below are the major challenges encountered
when working with large datasets and their impact on the
different stages of data handling:

1. Data Preprocessing Challenges


a) Memory Limitations

 Challenge: Large datasets can easily exceed the available


memory (RAM) on a typical machine. Loading the entire
dataset into memory may lead to crashes or significant
slowdowns in the system.
 Impact: This directly affects preprocessing tasks such as
cleaning, transforming, and normalizing data. For
example, operations like removing duplicates, filling
missing values, and encoding categorical features may
become difficult or impossible to execute on the full
dataset.
 Solution:
o Chunking: Process data in smaller chunks that fit
into memory.
o Dask or Vaex: Use out-of-core libraries designed
to handle large datasets that don't fit in memory.
o Efficient Data Types: Reduce memory
consumption by using more compact data types
(e.g., float32 instead of float64).
b) Data Quality and Inconsistencies

 Challenge: Large datasets often contain missing values,


outliers, duplicate entries, or inconsistent formats.
 Impact: Cleaning and handling missing or inconsistent
data can be very time-consuming and complex in large
datasets. Inconsistent data may affect data preprocessing
tasks such as imputation, encoding, and scaling.
 Solution:
o Automated Data Cleaning: Use automated scripts
or libraries (e.g., pandas, numpy) to handle missing
data, duplicates, and outliers in bulk.
o Distributed Processing: Utilize distributed
frameworks (e.g., Apache Spark, Dask) to clean
data across multiple nodes.

c) Data Transformation Complexity

 Challenge: Transforming large datasets—such as scaling,


normalizing, encoding, or feature engineering—can be
resource-intensive and slow.
 Impact: Time-consuming transformations on large
datasets can delay analysis and modeling processes.
Complex transformations may also require more
computational power.
 Solution:
o Parallel Processing: Use libraries that support
parallel processing (e.g., joblib, Dask).
o Incremental Learning: Use algorithms that
support incremental learning or mini-batch
processing (e.g., Stochastic Gradient Descent
(SGD), Naive Bayes).

2. Storage Challenges
a) Storage Capacity

 Challenge: Large datasets can take up significant storage


space, which might exceed local storage capacity.
 Impact: Storing large amounts of data can be costly,
particularly if data needs to be stored in high-performance
formats for quick access. It may also slow down
read/write operations.
 Solution:
o Data Compression: Compress data using formats
like Parquet, ORC, or HDF5, which reduce the
storage size without losing data.
o Distributed Storage: Use cloud-based storage
systems (e.g., Amazon S3, Google Cloud
Storage) or distributed file systems like HDFS
(Hadoop Distributed File System) for large-scale
storage.
o Efficient Data Formats: Use binary file formats
like Parquet and Feather for efficient storage and
fast access.

b) Data Integration and Formats

 Challenge: Large datasets often come from multiple


sources and in various formats, such as CSV, JSON,
XML, or databases.
 Impact: Merging or integrating data from heterogeneous
sources can introduce additional complexity, and working
with multiple formats may require additional
preprocessing steps like parsing, converting, or
standardizing formats.
 Solution:
o Data Lakes: Use data lakes to store large
volumes of raw, unstructured data and then
process it as needed.
o ETL (Extract, Transform, Load): Implement
ETL processes to transform data into a consistent
format for analysis.

3. Analysis Challenges
a) Slow Computation and Processing Time

 Challenge: Analyzing large datasets (e.g., performing


complex calculations, aggregations, or machine learning
model training) requires considerable computational
power.
 Impact: Data analysis can take a long time and may result
in bottlenecks, especially if the data cannot be processed
in parallel or distributed across multiple nodes.
 Solution:
o Distributed Computing: Use distributed
frameworks like Apache Spark, Dask, or Hadoop
that can distribute tasks across multiple nodes and
process the data in parallel.
o Sampling: If full analysis isn't feasible, use
sampling techniques to work with a subset of the
data.
o GPU Acceleration: Use GPUs to speed up
computation for large-scale machine learning and
deep learning tasks.

b) Modeling and Scalability

 Challenge: Machine learning models might struggle with


very large datasets in terms of both training time and the
ability to scale.
 Impact: The time to train machine learning models on
large datasets can become prohibitive. Furthermore, not
all machine learning algorithms are optimized for large-
scale data, and some may require adjustments to handle
them.
 Solution:
o Mini-Batch Processing: Use algorithms that
support mini-batch training (e.g., Stochastic
Gradient Descent, Neural Networks).
o Distributed Machine Learning: Use frameworks
like Apache Spark MLlib, TensorFlow, or
PyTorch with distributed training capabilities.

c) Data Shuffling and Random Access

 Challenge: For machine learning, random access to large


datasets and the ability to shuffle data efficiently are
important for model training (to avoid overfitting).
 Impact: Large datasets are difficult to shuffle efficiently
in memory, and loading random samples may require
specialized techniques.
 Solution:
o Data Generators: Use data generators that allow
you to load data in batches (e.g., Keras data
generators).
o Indexing and Preprocessing: Preprocess and
index data in a way that enables fast access during
training.

4. Security and Privacy Concerns


a) Data Security

 Challenge: Large datasets may contain sensitive personal


or business data. Managing access and ensuring secure
storage is vital.
 Impact: Handling sensitive data in large quantities
increases the risk of breaches, which can have legal and
ethical consequences.
 Solution:
o Encryption: Encrypt sensitive data both at rest
and in transit.
o Access Control: Implement strict access control
policies, including role-based access and auditing.

b) Privacy Issues

 Challenge: Large datasets often involve personal or


confidential data, making it difficult to ensure compliance
with privacy regulations (e.g., GDPR, HIPAA).
 Impact: Processing and storing data in compliance with
privacy laws can be complex, especially when working
with large datasets.
 Solution:
o Data Anonymization: Anonymize or
pseudonymize sensitive data before processing or
analysis.
o Compliance Frameworks: Implement
frameworks and policies to ensure compliance
with data protection laws

(OR)

10 3 3 5
16 b Explain the concept of data wrangling and discuss the key steps
involved in the data wrangling process and the importance of each step.
Data wrangling (also known as data munging) is the process of
transforming and mapping raw data into a more useful and
accessible format for analysis. It involves cleaning, restructuring,
and enriching raw data from various sources to make it suitable
for analysis and decision-making. Data wrangling is often
considered one of the most time-consuming and important tasks
in the data analysis pipeline.

Key Steps in the Data Wrangling Process

1. Data Collection
o Description: Gathering data from various sources
(e.g., databases, flat files like CSV, JSON, XML,
APIs, web scraping, or sensor data).
o Importance: This is the foundational step where
the raw data is gathered. It sets the stage for all
subsequent steps in the wrangling process.
o Challenges: Data could be incomplete, in
inconsistent formats, or in a form that is difficult to
analyze.
o Tools: APIs, web scraping tools (e.g.,
BeautifulSoup), SQL queries, data import
functions in libraries (e.g., pandas.read_csv()).
2. Data Inspection/Exploration
o Description: Inspecting the dataset to understand
its structure, content, and identify any problems
such as missing values, duplicates, or incorrect
formats.
o Importance: This step helps to get a feel for the
data and ensures that any issues or anomalies are
identified before any transformations are done.
o Challenges: Data might be large, unstructured, or
might contain inconsistencies that are hard to
detect manually.
o Tools: pandas (e.g., df.info(), df.describe(),
df.head()), matplotlib, seaborn (for
visualization), or any other exploratory data
analysis (EDA) tool.
3. Data Cleaning
o Description: Removing or correcting any errors in
the data, such as missing values, duplicates,
inconsistent data types, or outliers.
o Importance: Cleaning ensures the accuracy and
quality of the data. Poor-quality data can lead to
misleading results in analysis or modeling.
o Challenges: Dealing with missing values,
correcting inconsistent data entries, handling noisy
data.
o Tools: pandas (fillna(), dropna(),
drop_duplicates(), astype()), numpy (e.g.,
np.nan for missing values).
4. Data Transformation
o Description: Transforming the data into a more
suitable format for analysis. This may involve
normalizing or scaling numerical values,
converting categorical variables to numerical ones,
or reshaping the data.
o Importance: Transformations help prepare the
data for various types of analysis or modeling.
Some algorithms require data to be in a specific
format (e.g., scaling for neural networks).
o Challenges: Applying the right transformations
can be complex, especially with heterogeneous
data types (e.g., mixing categorical and numerical
data).
o Tools: pandas (e.g., pd.get_dummies() for one-
hot encoding, StandardScaler from sklearn for
scaling), numpy for mathematical transformations.
5. Data Integration
o Description: Combining data from multiple
sources or datasets, ensuring that the combined
data is consistent and compatible.
o Importance: Many datasets are spread across
different sources. Integration allows data from
these sources to be merged into a single dataset for
analysis.
o Challenges: Merging datasets may introduce
discrepancies (e.g., mismatched keys, inconsistent
formats) that need to be resolved.
o Tools: pandas (e.g., merge(), concat()), SQL
join operations, or using ETL tools for larger
datasets.
6. Data Enrichment
o Description: Enhancing the dataset with additional
information, such as external data sources or
creating new features.
o Importance: Enriching the data helps improve the
quality and comprehensiveness of the dataset,
allowing for more insightful analysis.
o Challenges: Adding external data can introduce its
own inconsistencies or issues like missing values.
o Tools: APIs, web scraping, and additional datasets
from open data repositories.
7. Data Formatting
o Description: Converting data into the required
format, such as ensuring that numerical columns
are numeric and categorical columns are properly
labeled.
o Importance: Correct formatting is essential for the
subsequent steps in the analysis or modeling
pipeline.
o Challenges: Ensuring all columns are consistently
formatted, especially when dealing with large
datasets with diverse data types.
o Tools: pandas for type casting (e.g.,
df['column'].astype(int)), str functions for
string manipulation.
8. Data Sampling/Resampling (if needed)
o Description: Reducing the dataset size by
sampling a subset of data (if the dataset is too
large) or balancing the dataset (e.g., in
classification problems with imbalanced classes).
o Importance: Sampling can reduce the
computational complexity and speed up the
analysis, while resampling ensures that models are
not biased due to class imbalances.
o Challenges: Ensuring that the sample is
representative of the full dataset and that
resampling does not distort the underlying
patterns.
o Tools: pandas (e.g., df.sample()), imblearn for
oversampling/undersampling.
9. Data Validation
o Description: Ensuring that the cleaned,
transformed, and integrated data meets the
requirements of the analysis or machine learning
models.
o Importance: Validation ensures that the dataset is
accurate, complete, and ready for use in the next
stage of analysis or modeling.
o Challenges: Performing robust validation,
especially with large datasets, can be difficult and
time-consuming.
o Tools: Manual checks, statistical methods, or
automated validation scripts.

The Importance of Each Step in Data Wrangling

1. Data Collection: Ensures the right data is obtained for


analysis.
2. Data Inspection/Exploration: Helps uncover patterns
and potential issues with the data early on.
3. Data Cleaning: Improves data quality, preventing errors
from affecting the analysis.
4. Data Transformation: Prepares the data in a format
suitable for modeling and analysis.
5. Data Integration: Combines disparate data sources,
enabling a more comprehensive analysis.
6. Data Enrichment: Enhances the dataset with additional
useful information, improving the depth of the analysis.
7. Data Formatting: Ensures the data is in a consistent
format, which is essential for correct processing and
analysis.
8. Data Sampling/Resampling: Helps manage large
datasets and deal with class imbalances for more accurate
and efficient analysis.
9. Data Validation: Ensures the data meets quality and
consistency requirements before further processing.

Challenges in Data Wrangling

 Data Volume: Large datasets can make it difficult to


perform operations like cleaning, transformation, and
validation efficiently.
 Data Quality: Handling missing values, duplicates, and
inconsistent data can be time-consuming.
 Data Compatibility: Different datasets may have
different formats or schemas, requiring substantial work to
merge or integrate them.

17 a i) Write a Python program to Create a pie chart using 10 2 4 5


Matplotlib showing the percentage distribution of students
enrolled in different courses (e.g., Python, Java, C++, AI).

import matplotlib.pyplot as plt

# Data for the pie chart: Course names and the number
of students
courses = ['Python', 'Java', 'C++', 'AI']
students = [150, 120, 90, 60]

# Create a pie chart


plt.figure(figsize=(7,7))
plt.pie(students, labels=courses, autopct='%1.1f%%',
startangle=140,
colors=['#ff9999','#66b3ff','#99ff99','#ffcc99'])

# Title of the pie chart


plt.title('Percentage Distribution of Students
Enrolled in Different Courses')

# Display the pie chart


plt.show()

Explanation:

 Data: The students list represents the number of


students enrolled in each course (Python, Java, C++, AI).
 Pie Chart: The plt.pie() function is used to create the
pie chart. The autopct='%1.1f%%' argument displays the
percentage values on the chart with one decimal point.
 Colors: The colors argument is used to customize the
colors of each segment of the pie chart.
 Title: plt.title() sets the title of the chart.

Output:
This will display a pie chart showing the percentage distribution
of students in the Python, Java, C++, and AI courses.
Let me know if you'd like any adjustments or further explanation
on any part!

ii) Write a Python program to draw a simple line graph using


Matplotlib to represent the number of visitors to a website
over 7 days.

Here's a Python program that draws a simple line graph using


Matplotlib to represent the number of visitors to a website over 7
days:
import matplotlib.pyplot as plt

# Data for the line graph: Days of the week and


number of visitors
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday',
'Friday', 'Saturday', 'Sunday']
visitors = [120, 150, 170, 140, 160, 190, 200]

# Create a line graph


plt.plot(days, visitors, marker='o', linestyle='-',
color='b')

# Title and labels


plt.title('Website Visitors Over 7 Days')
plt.xlabel('Days of the Week')
plt.ylabel('Number of Visitors')

# Display the line graph


plt.grid(True)
plt.show()

Explanation:

 Data: The days list represents the days of the week, and
the visitors list represents the number of visitors to the
website for each corresponding day.
 Line Graph: The plt.plot() function is used to plot the
line graph.
o marker='o' adds a marker at each data point (a
circle in this case).
o linestyle='-' ensures that the points are
connected with a line.
o color='b' sets the line color to blue.
 Title and Labels: plt.title(), plt.xlabel(), and
plt.ylabel() are used to set the title and axis labels.
 Grid: plt.grid(True) adds a grid to the graph to make it
easier to read the values.

Output:
This will display a line graph representing the number of visitors
to a website over the span of 7 days (Monday to Sunday).
Let me know if you need any further adjustments or explanations!

(OR)
17 b 10 3 5 5
i) Define Seaborn? How does it differ from Matplotlib?
Write a Python program to draw a scatter plot using Seaborn
showing the relationship between height and weight of
individuals.

What is Seaborn?
Seaborn is a Python visualization library built on top of
Matplotlib. It provides a high-level interface for drawing
attractive and informative statistical graphics. Seaborn comes
with several built-in themes and color palettes that make it easy to
generate aesthetically pleasing plots with minimal code.

Differences between Seaborn and Matplotlib

1. Ease of Use:
o Matplotlib: While powerful and highly
customizable, Matplotlib requires more lines of
code to generate common statistical plots. It is
great for creating basic and complex plots but can
be verbose.
o Seaborn: It is built to simplify the process of
creating complex visualizations, especially for
statistical data. It provides high-level functions that
automatically handle many details, such as axes
labels, legends, color schemes, etc.
2. Style and Aesthetics:
o Matplotlib: While Matplotlib can generate a wide
range of plots, the default style is relatively basic.
Customizing the appearance (e.g., changing colors,
themes) requires extra work.
o Seaborn: It comes with built-in themes, color
palettes, and automatic formatting, making it much
easier to generate more visually appealing plots
with minimal customization.
3. Statistical Plotting:
o Matplotlib: It is primarily focused on general
plotting but does not offer built-in support for
statistical visualizations (e.g., heatmaps, regression
plots).
o Seaborn: It includes specialized functions for
creating statistical plots like regression plots, box
plots, violin plots, and heatmaps, making it ideal
for exploratory data analysis.
4. Integration with Pandas:
o Matplotlib: While it can work with Pandas
DataFrames, it doesn't provide direct support for
DataFrame operations.
o Seaborn: It works seamlessly with Pandas
DataFrames and provides functions that directly
accept DataFrame columns as input.

Python Program: Scatter Plot using Seaborn


Here’s a Python program to draw a scatter plot using Seaborn to
show the relationship between the height and weight of
individuals:
import seaborn as sns
import matplotlib.pyplot as plt

# Sample data for height and weight


data = {
'Height': [5.5, 6.1, 5.8, 5.9, 6.0, 5.4, 5.7,
6.2, 5.6, 5.8],
'Weight': [150, 180, 165, 170, 175, 160, 155,
185, 168, 162]
}

# Create a DataFrame from the data


import pandas as pd
df = pd.DataFrame(data)

# Create a scatter plot using Seaborn


sns.scatterplot(x='Height', y='Weight', data=df)

# Title and labels


plt.title('Scatter Plot: Height vs Weight')
plt.xlabel('Height (in feet)')
plt.ylabel('Weight (in lbs)')

# Display the plot


plt.show()

Explanation:

1. Data: The data dictionary contains two lists: Height and


Weight, which represent the height (in feet) and weight
(in pounds) of individuals.
2. DataFrame: The pd.DataFrame(data) converts the
dictionary into a Pandas DataFrame, making it easy to
work with Seaborn.
3. Scatter Plot: The sns.scatterplot() function is used to
create a scatter plot. The x and y arguments specify which
columns to plot on the x and y axes, and the data
argument specifies the DataFrame to use.

ii) Write a Python program using Seaborn to create a histogram


that displays the distribution of students' exam scores.
Customize the bin size and add color.

Here's a Python program using Seaborn to create a histogram


that displays the distribution of students' exam scores, with
customized bin size and color:
import seaborn as sns
import matplotlib.pyplot as plt

# Sample data: Exam scores of students


exam_scores = [95, 88, 76, 85, 92, 99, 78, 85, 93,
89, 84, 91, 77, 80, 94, 79, 87, 82, 83, 90]

# Create a histogram using Seaborn


sns.histplot(exam_scores, bins=8, kde=False,
color='skyblue', edgecolor='black')

# Title and labels


plt.title('Distribution of Students\' Exam Scores')
plt.xlabel('Exam Scores')
plt.ylabel('Frequency')
# Display the plot
plt.show()

Explanation:

1. Data: The list exam_scores contains the exam scores of


20 students.
2. sns.histplot():
o bins=8: This customizes the number of bins in the
histogram to 8.
o kde=False: Disables the Kernel Density Estimate
(KDE) plot, which would otherwise show a
smooth curve over the histogram.
o color='skyblue': Sets the color of the bars to a
light blue.
o edgecolor='black': Adds a black border to the
histogram bars for better visibility.
3. Title and Labels: The plt.title(), plt.xlabel(), and
plt.ylabel() functions are used to set the title and axis
labels for the plot.
4. Display: plt.show() displays the histogram.

Course Outcome (CO) and Bloom’s level (BL) Coverage in Questions:

CO Coverage
60 53 %
50
40
30 26 %
21 %
20
10
0
CO 1 CO 2 CO 3

You might also like