Data Science Papers
Data Science Papers
ber
CourseArticulationMatrix:
Course
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
CO1 - - - - 1 - - - - - - -
CO2 - - - - 1 - - - - - - -
Note: CO1 - To understand the relationship between data
CO2 - Identify the different data structures to represent data
Part– A
(5x2= 10 Marks)
Answer ALL the questions
Q.No Question Marks BL CO PO PI.Code
1 How do you concatenate two Numpy arrays along a 2 2 1 5 5.4.1
specified axis?
Use numpy.concatenate() to concatenate two
NumPy arrays along a specified axis.
Ex code:
import numpy as np
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6]])
result = np.concatenate((arr1, arr2), axis=0)
# Concatenates along rows
print(result)
1. Data Collection:
BL COVERAGE PERCENTAGE
CO Coverage
BL1
Percentage 16%
15 BL3
36%
10
5 Percentage
0
BL2
CO1 CO2
48%
RegisterNum
ber
CourseArticulationMatrix:
Course
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
CO1 - - - - 1 - - - - - - -
CO2 - - - - 1 - - - - - - -
Note: CO1 - To understand the relationship between data
CO2 - Identify the different data structures to represent data
Part– A
(5x2= 10 Marks)
Answer ALL the questions
Q.No Question Marks BL CO PO PI.Code
1 Given the NumPy array arr = np.array([[1, 2, 3], [4, 5, 6], 2 3 2 5 5.4.2
[7, 8, 9]]), write the code to extract the second column as a
1D array.
import numpy as np
print(second_column)
Output:
[2 5 8]
2 How do you select a column from a Pandas DataFrame? 2 1 2 5 5.4.1
Write the code.
# Create a DataFrame
data = {'A': [1, 4, 7], 'B': [2, 5, 8],
'C': [3, 6, 9]}
df = pd.DataFrame(data)
print(column_b)
Output
0 2
1 5
2 8
Name: B, dtype: int64
3 Mention two sources from which data can be acquired for 2 1 1 5 5.5.1
analysis.
Two common sources from which data can be acquired for
analysis are:
1. Web APIs
o Many online services provide APIs to
fetch structured data in formats like JSON
or XML.
o Example: Twitter API for social media
analysis, OpenWeather API for weather
data, and financial APIs for stock market
data.
2. Public Datasets and Open Data Portals
o Governments, research organizations, and
companies provide free datasets for public
use.
o Example: Kaggle
(https://www.kaggle.com/datasets),
Google Dataset Search, and UCI Machine
Learning Repository
4 Write a Python program to add, subtract, multiply and 2 2 1 5 5.4.2
divide two Pandas Series
Sample Series: [2, 4, 6, 8, 10], [1, 3, 5, 7, 9]
import pandas as pd
# Display results
print("Addition:\n", addition)
print("\nSubtraction:\n", subtraction)
print("\nMultiplication:\n",
multiplication)
print("\nDivision:\n", division)
Output
Addition:
0 3
1 7
2 11
3 15
4 19
dtype: int64
Subtraction:
0 1
1 1
2 1
3 1
4 1
dtype: int64
Multiplication:
0 2
1 12
2 30
3 56
4 90
dtype: int64
Division:
0 2.000000
1 1.333333
2 1.200000
3 1.142857
4 1.111111
dtype: float64
5 What are Web APIs and how are they used in Data 2 2 1 5 5.4.1
Acquisition?
Part– B
(3x5= 15 Marks)
(Diagram - 1 mark)
Explanation of each stage (4 marks)
2 You're tasked with exploring a large dataset using Pandas. You 5 2 2 5 5.5.1
suspect there might be a relationship between two columns:
'age' (numerical) and 'purchase_category' (categorical).
Describe how you would use Pandas to investigate this potential
relationship. Mention TWO specific Pandas functions you
would use and explain their purpose in this context."
# Sample DataFrame
data = {'age': [25, 34, 45, 23, 41, 36, 29, 50],
'purchase_category': ['Electronics', 'Clothing', 'Electronics',
'Books', 'Books', 'Clothing', 'Electronics', 'Books']}
df = pd.DataFrame(data)
BL COVERAGE PERCENTAGE
CO Coverage
BL1
Percentage 16%
15 BL3
36%
10
5 Percentage
0
BL2
CO1 CO2
48%
RegisterNum
ber
CourseArticulationMatrix:
Course
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
- - - - 1 - - - - - - -
CO1
CO2 - - - - 1 - - - - - - -
Note: CO1 - To understand the relationship between data
CO2 - Identify the different data structures to represent data
Part– A
(5x2= 10 Marks)
Answer ALL the questions
Q.No Question Marks BL CO PO PI.Code
1 What is the goal of the "exploratory data analysis" phase? 2 1 1 5 5.6.1
2 Write the syntax to create a 1D NumPy array from a Python 2 1 1 5 5.6.1
list.
3 Why are NumPy arrays more efficient than Python lists for 2 2 1 5 5.4.1
numerical operations?
4 Compare a Python list and a Pandas Series? 2 2 2 5 5.4.1
5 How would you display the first five rows of a DataFrame? 2 2 2 5 5.4.1
Part– B
(3x5= 15 Marks)
10 BL3 Percenta
36% ge
5 Percentage
BL2
48%
0
CO1 CO2
Key:
import numpy as np
// Creating a 1D NumPy array from a Python list
my_list = [1, 2, 3, 4, 5] 1M
np_array = np.array(my_list) 1M
print(np_array)
3. Why are NumPy arrays more efficient than Python lists for numerical operations?
NumPy is faster and more memory-efficient than Python lists because of contiguous memory storage,
vectorized operations(operations are applied to all elements in an array without the need for explicit loops in Python),
broadcasting, and optimized C-based backend (uses BLAS (Basic Linear Algebra Subprograms) and LAPACK(Linear
Algebra PACKage), which are highly optimized C libraries) computations. Any two explanations each 1M
first five rows of a Pandas DataFrame can be displayed using the .head() method. 2M
Part B
1. Explain the different facets of data in Data Science with suitable examples.
Very large amount of data will generate in big data and data science. These data is various types and main categories of
data are as follows:
a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
g) Audio, video and images each 1Mark with appropriate explanation (any five)
2.
Ans: Data Science primarily involve methods to collect raw data from various sources, including sensors,
databases, APIs, and manual inputs
Methods of different data collection includes primary data ans secondary data.
Primary data:
Direct Personal Investigation:
Indirect Oral Investigation:
Information from Local Sources or Correspondents
Information through Questionnaires and Schedules
Mailing Method
Enumerator’s Method
Any 3 methods with explanation 3 x 1 =3M
Secondary data
CourseArticulationMatrix:
Course
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
CO1 - - - - 1 - - - - - - -
CO2 - - - - 1 - - - - - - -
Note: CO1 - To understand the relationship between data
CO2 - Identify the different data structures to represent data
Part– A
(5x2= 10 Marks)
Answer ALL the questions
Q.No Question Marks BL CO PO PI.Code
What are the uses of NumPy?
import numpy as np
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
result = np.concatenate((a, b), axis=0)
3 print(result) 2 2 2 5 5.4.1
And
hstack()
vstack()
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Salary': [50000, 60000, 70000, 80000]}
5 df = pd.DataFrame(data) 2 3 2 5 5.4.1
sorted_df = df.sort_values(by=['Age', 'Salary'],
ascending=[True, False])
print(sorted_df)
Part– B
(3x5= 15 Marks)
import BeautifulSoup
import requests
url = "https://www.example.com/product"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
price = soup.find('span', {'class': 'price'}).text
print(price)
import pandas as pd
df = pd.DataFrame({'ID': [101, 102, 103, 104],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
9 5 4 2 5 5.5.1
'Salary': [50000, 60000, 70000, 80000]})
3 What change should be made to the following code to perform column-wise 1 1 3 5 1.4.1
concatenation? concat_df = pd.concat([df1, df2], -----------)
A.concat_df = pd.concat([df1, df2], axis=2)
B.concat_df = pd.concat([df1, df2], axis=1)
C.concat_df = pd.concat([df1, df2], axis=0)
D.concat_df = pd.concat([df1, df2], axis=’TRUE’)
4 Which of the following libraries is not primarily involved in handling large 1 2 3 5 1.4.1
volumes of data?
A. Cython
B. Numexpr
C. Numba
D.Seaborn
5 Which of the following statements is true regarding data structures? 1 2 3 5 1.4.1
A) Data structures have the same storage requirements for all types.
B) Data structures influence the performance of CRUD operations (create,
read, update, and delete).
C) Data structures only affect the storage and not the performance of
operations.
D) Data structures do not affect the performance of CRUD operations.
6 Which of the following is the correct syntax for creating a subplot with 2 1 1 4 5 1.4.1
rows and 3 columns in the first position?
A) plt.subplot(2, 3, 0)
B) plt.subplot(3, 2, 1)
C) plt.subplot(2, 3, 1)
D) plt.subplot(1, 2, 3)
7 Which parameter is used to create 100 evenly spaced values between 0 and 1 1 4 5 1.4.1
10?
A) np.linspace(0, 10, 100)
B) np.linspace(0, 100, 10)
C) np.linspace(0, 10, 100)
D)np.linspace(10, 100, 0)
9 Which of the following best describes the purpose of GridSpec in data 1 2 5 5 1.4.1
visualization?
A) Group data by a categorical variable and create subplots for each
category.
B) Visualize the relationship between two variables along with their
distributions.
C) Create custom grid layouts for organizing multiple subplots.
D) Plot the relationships between all numeric column pairs in a DataFrame.
Part – B (4 x 5 = 20 Marks)
Instructions: Answer ANY FOUR Questions
Q. Question Marks BL CO PO PI
No Code
11 5 2 3 5 1.4.1
Write a Python program to do the following:
a. Replace all missing (NaN) values in the Name column with the
string 'Unknown'.
b. Replace all missing (NaN) values in the Age column with the
mean of the available age values.
c. Add a new column named City and fill it with any default or
custom city names for each student.
d. Print the final cleaned DataFrame.
Name Age
Bob 24
NaN 25
Sweety Nan
Rita 26
import pandas as pd
import numpy as np
df = pd.DataFrame(data)
# Step 3: Add a new column 'City' with default value (e.g., 'Delhi')
df['City'] = ['Delhi', 'Mumbai', 'Pune', 'Chennai'] # You can customize this
Sample Example
13 What is reshaping in pandas, and what are the main methods used for 5 2 3 5 1.4.1
reshaping a DataFrame?
# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 20, 25, 30, 40]
# Set labels
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.title('Rotating Tick Labels Example')
# Rotate x-axis tick labels by 45 degrees
plt.xticks(rotation=45)
15 What are the various types of annotations in Matplotlib? Give the syntax 5 3 5 5 1.4.1
of annotation .
Part – C (2 x 10 = 20 Marks)
Instructions: Answer ALL questions.
Q. Question Marks BL CO PO PI
No Code
16 (i)Explain outliers and their types. (5 marks) 10` Q 3 5 1.4.1
Outlier Noise or Outliers are the data points which deviate
significantly from the norm.
Outliers can be single data points, or a subset of observations called
a collective outlier.
The outlier data points can greatly impact the accuracy and
reliability of statistical analyses and machine learning models.
Outliers can also be called abnormalities, discordant, deviants, or
anomalies.
Types of outlier
Global outliers
• Global outliers are isolated data points that are far away
from the main body of the data.
• They are often easy to identify and remove.
Contextual outliers
• Contextual outliers are data points that are unusual in a
specific context but may not be outliers in a different
context.
They are often more difficult to identify and may require
additional information or domain knowledge to determine
their significance
(ii)We create a panda DataFrame from a dictionary that holds the
student data. student's ID, first name, last name, and grade(5Marks)
a. Combine First Name and Last Name into a new column called Full
Name.
b. Display only the First Name and Grade columns.
c. Identify and display students who received a grade 'A'.
d. Create a new column Updated Grade, where every 'B' grade is
replaced with 'A'.
import pandas as pd
# a. Combine First Name and Last Name into a new column called Full
Name
students['Full Name'] = students['First Name'] + ' ' + students['Last
Name']
print(" DataFrame with Full Name:\n", students, "\n")
Z-Score Min-max
import pandas as pd
# Sample dataset
data = {
'ID': [1, 2, 3, 4],
'First Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Last Name': ['Smith', 'Jones', 'Brown', 'Taylor'],
'Grade': ['A', 'B', 'B', 'C']
}
# Create DataFrame
df = pd.DataFrame(data)
10 2 4 5 1.41
17 a Explain the features of seaborn(5)
• Statistical Graphics: Seaborn is specifically designed for
creating statistical graphics, providing built-in functions for
common visualizations like scatter plots, line plots, histograms,
and more. This makes it easier to create visually appealing and
informative plots for data analysis.
• Data Visualization Themes: Seaborn offers pre-defined styles
and themes that can quickly change the overall appearance of
your plots. This helps create consistent and aesthetically
pleasing visualizations without requiring extensive
customization.
• Integration with Pandas and NumPy: Seaborn seamlessly
integrates with Pandas and NumPy, making it easy to work
with dataframes and arrays directly. This simplifies the
workflow and reduces the amount of code needed for data
analysis and visualization.
• FacetGrid and Pair Plots: Seaborn provides FacetGrid for
grouping data and creating subplots based on categorical
variables. This is useful for comparing distributions or
relationships across different groups. Pair plots allow you to
visualize the relationships between all pairs of numeric
columns in a DataFrame, helping you identify correlations and
patterns.
• Customization and Flexibility: While Seaborn provides a
high-level interface, it's built on top of Matplotlib, giving you
access to its extensive customization options. This allows you
to fine-tune your plots to meet your specific needs.
• Ease of Use: Seaborn's API is designed to be user-friendly and
intuitive, making it easier to learn and use compared to
Matplotlib. Its documentation is also well-written and provides
clear examples.
# Sample data
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
product_a_sales = [20, 35, 30, 35, 27, 40]
product_b_sales = [25, 32, 34, 20, 25, 30]
# Add a legend
plt.legend()
(OR)
17 b 10 3 5 5 1.4.1
Give your own Seaborn library example for a 3D line plot, 3D scatter
plot, and 3D surface plot. Draw the output for each example.
import seaborn as sns # Not directly used for surface plots
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
# Sample data (ensure x and y are 2D for surface plot)
x = np.linspace(0, 5, 10) # Create equally spaced points from 0 to 5
with 10 elements
y = np.linspace(0, 5, 10)
X, Y = np.meshgrid(x, y) # Create a 2D grid from x and y for surface
evaluation
def f(x, y):
return x**2 + y**2 # Replace with your desired function
# Calculate z values based on the function
z = f(X, Y)
# Create a 3D figure and axes
fig = plt.figure(figsize=(8, 6)) # Adjust figure size as needed
ax = fig.add_subplot(111, projection='3d')
surf = ax.plot_surface(X, Y, z, cmap='viridis', linewidth=0,
antialiased=True) # Adjust colormap
ax.set_xlabel('X-axis')
ax.set_ylabel('Y-axis')
plt.title('3D Surface Plot’)
# Customize viewing angle (optional)
ax.view_init(elev=20, azim=45) # Adjust elevation and azimuth
angles
# Show the plot
plt.show()
Course Outcome (CO) and Bloom’s level (BL) Coverage in Questions
CO Coverage
60% 55%
50% 45%
40%
30%
20%
10%
0%
Register
Number
a) It allows the analyst to ignore one of the variables and focus only
on two.
b) A 3D plot helps display the relationship between all three variables
simultaneously.
c) 3D plots are only used for representing time series data.
d) It makes the data look more attractive, even if it doesn’t add any
analytical value.
9 A school wants to compare the math test scores of students from three 1 2 5 5 2.1.3
different classes (Class A, Class B, and Class C). The data science
teacher uses Matplotlib to create a box plot for each class.What is the
main reason for using a box plot in this situation?
Part – B (4 x 5 = 20 Marks)
Instructions: Answer ANY FOUR Questions
Q. Question Marks BL CO PO PI
No Code
11 List out different approaches used to combine different datasets with 5 2 3 5 2.1.2
example.
1. Concatenation (Vertical/Horizontal)
o Example: pd.concat([df1, df2], axis=0) (vertical) or
pd.concat([df1, df2], axis=1) (horizontal)
2. Merging (SQL-style joins)
o Example: pd.merge(df1, df2, on='common_column',
how='inner')
3. Joining
oExample: df1.join(df2, on='common_column', how='left')
4. Appending
o Example: df1.append(df2, ignore_index=True)
5. Union
o Example: Combining rows from two datasets with the
same columns: pd.concat([df1, df2], axis=0,
ignore_index=True)
6. Cross Join
o Example: Using a Cartesian product to combine
datasets: df1.merge(df2, how='cross')
7. Concatenation by Index
o Example: df1.append(df2, ignore_index=False)
12 What are the conditions used to choose the data binning techniques with 5 3 3 5 2.1.2
example?
1. Nature of Data
o Uniform Data: Equal-width Binning
o Skewed Data: Equal-frequency Binning
2. Number of Bins
o Fixed Number of Bins: Equal-width or Equal-frequency
Binning
o Adaptive Binning: Custom Binning or Clustering-based
Binning
3. Distribution of Data
o Normal Distribution: Equal-width Binning
o Non-Normal Distribution: Equal-frequency Binning
4. Handling Outliers
o Outlier-prone Data: Adaptive Binning or Clustering-
based Binning
5. Interpretability of Bins
o Interpretable Bins: Custom Binning based on Domain
Knowledge
13 What are the methods used to categorize the Noise and Outliers in the 5 2 3 5 2.2.3
dataset?
1. Statistical Methods:
o Z-Score (Standard Deviation Method)
o IQR (Interquartile Range) Method
o Modified Z-Score
2. Visual Methods:
o Box Plot
o Scatter Plot
o Histogram
3. Machine Learning Methods:
o DBSCAN (Density-Based Spatial Clustering of
Applications with Noise)
o Isolation Forest
o One-Class SVM
4. Domain Knowledge:
o Expert-defined thresholds or rules for outlier detection
5. Proximity-Based Methods:
o k-Nearest Neighbors (k-NN)
o Local Outlier Factor (LOF)
These methods help identify and categorize noise and outliers based on
statistical properties, clustering, or domain-specific rules.
14 Write the python code to plot 3D and Scatter plot using Matplotlib? 5 3 4 5 2.2.3
3D –Plot code:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
# Create data
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
x, y = np.meshgrid(x, y)
z = np.sin(np.sqrt(x**2 + y**2))
# Create 3D plot
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
# Show plot
plt.show()
Scattar Plot:
# Create data for scatter plot
x = [1, 2, 3, 4, 5]
y = [5, 4, 3, 2, 1]
# Show plot
plt.show()
15 What are the different annotations used to plot the subplots in 5 3 4 5 2.2.3
Matplotlib? Give example.
plt.annotate()
ax.text()
Explanation includes:
Part – C (2 x 10 = 20 Marks)
Instructions: Answer ALL questions.
Q. Question Marks BL CO PO PI
No Code
16 a Discuss about various Data transformation techniques in detail with 10 2 3 5 2.2.3
example.
Normalization
Standardization
Log Transformation
Power Transformation
Binning (Discretization)
Encoding Categorical Variables
One-Hot Encoding
Label Encoding
Feature Scaling
Quantile Transformation
PCA (Principal Component Analysis)
Polynomial Transformation
Handling Skewed Data
Text Vectorization (TF-IDF, Count Vectorizer)
Date-Time Feature Extraction
(OR)
16 b Consider you are a data analyst for a smart city initiative that 10 3 5 5 3.3.1
monitors Electric Vehicle (EV) charging station usage across
different locations. Your goal is to clean, transform, and analyze
the data to optimize charging station efficiency, reduce waiting
times, and improve user experience. The dataset contains EV
charging session logs collected from multiple charging stations and
includes the following attributes: Session ID, User ID, Station ID,
Location, Charging Start Time, Charging End Time, Charging
Duration, Energy Consumed (kWh),Cost ($),Payment Method etc.,
Answer Key:
Answer Key:
(OR)
17 b Consider a healthcare data analyst at a research institute studying 10 3 5 5 3.3.1
the connection between dietary habits and common lifestyle-
related diseases. A survey was conducted across different age
groups, and the collected data includes:
Participant_ID
Age_Group (e.g., Teen, Adult, Senior)
Diet_Type (e.g., Vegetarian, Non-Vegetarian, Vegan,
Junk Food)
Common_Disease (e.g., Obesity, Diabetes, Hypertension,
Heart Disease, None)
Exercise_Hours_per_Week
Answer Key:
Part – B (4 x 5 = 20 Marks)
Instructions: Answer ANY FOUR Questions
Q. Question Marks BL CO PO PI
No Code
# Create data
X = np.linspace(-5, 5, 100)
Y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(X, Y)
Z = np.sin(np.sqrt(X**2 + Y**2))
# Customize labels
ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.set_zlabel('Z axis')
# Show plot
plt.show()
Customization Options:
cmap: Color map for the surface (e.g., 'viridis', 'plasma').
ax.plot_surface(): You can add more options like edgecolor, alpha for
transparency, etc.
15 Use Seaborn to create a pairplot and customize its style using 5 3 5 5
sns.set_style() on iris dataset. What insights can a pairplot provide?
Ans:
import seaborn as sns
import matplotlib.pyplot as plt
# Create a pairplot
sns.pairplot(iris, hue='species')
Cluster Patterns: Helps detect if species clusters are separable based on the
features (e.g., the species may be visually separable in certain feature
combinations).
Part – C (2 x 10 = 20 Marks)
Instructions: Answer ALL questions.
Q. Question Marks BL CO PO PI
No Code
16 a Describe and compare various techniques used to clean and prepare raw 10 2 3 5
datasets for analysis. Include examples of handling missing data,
standardization, string cleaning, and binning. Give python code
examples of each.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, None, 7, 8]})
df_cleaned = df.dropna() # Remove rows with any missing values
o Impute missing data:
(OR)
import pandas as pd
import numpy as np
# Remove outliers
df = df[(df['Age'] >= lower_bound_age) & (df['Age'] <=
upper_bound_age)]
4. Numeric Scaling (Standardization):
Standardize numeric columns like 'Age' and 'Salary' to have
zero mean and unit variance.
scaler = StandardScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])
Final Dataframe:
print(df)
plt.plot(x, y)
plt.xlim(0, 5) # Set x-axis limit
plt.ylim(0, 20) # Set y-axis limit
plt.show()
2. Adding Labels and Title:
xlabel(), ylabel(), and title() are used to add labels and titles.
plt.plot(x, y)
plt.xlabel('X-axis Label')
plt.ylabel('Y-axis Label')
plt.title('Plot Title')
plt.show()
3. Legends:
Use legend() to add a legend to the plot. You can label your
plots during plotting and then call legend().
plt.plot(x, y, label='y = x^2')
plt.legend()
plt.show()
4. Annotations:
Use annotate() to add text or markers to specific points on
the plot.
plt.plot(x, y)
plt.annotate('Peak', xy=(2, 4), xytext=(3, 5),
arrowprops=dict(facecolor='red', arrowstyle="->"))
plt.show()
5. Applying Plot Styles:
Use plt.style.use() to apply predefined styles such as ggplot,
seaborn, etc.
plt.style.use('ggplot')
plt.plot(x, y)
plt.show()
plt.boxplot([1, 2, 3, 4, 5, 6, 7])
plt.show()
5. Scatter Plot:
o Use-case: Displays relationships between two
variables, useful for correlation analysis.
o Example: Visualizing the relationship between
height and weight.
(OR)
17 b Apply advanced Seaborn visualizations to explore patterns in a real 10 3 5 5
dataset. Include pair plots, heatmaps, and style settings. Write a Python
program to visualize a 3D surface plot. Explain each component used in
the plot.
Ans:
# Load dataset
iris = sns.load_dataset('iris')
# Set style
sns.set_style("whitegrid")
# Pair plot
sns.pairplot(iris, hue='species')
plt.show()
Explanation:
sns.set_style(): Sets plot background style.
# Create 3D plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
# Plot surface
surf = ax.plot_surface(X, Y, Z, cmap='viridis')
# Add labels
ax.set_xlabel('X Axis')
ax.set_ylabel('Y Axis')
ax.set_zlabel('Z Axis')
plt.title('3D Surface Plot')
plt.show()
Explanation:
Axes3D: Enables 3D plotting.
CO Coverage
60 53 %
50
40
30 26 %
21 %
20
10
0
CO 1 CO 2 CO 3
Register
Number
1 State the data wrangling operation that handles errors, missing data and 1 1 3 5 5.4.1
inconsistencies
a. Validation
b. Data enrichment
c. Cleaning
d. Organization
2 Name the pandas method that can be used to combine DataFrames using one 1 1 3 5 5.4.1
or more keys, as in database join operations
a. pandas.concat
b. pandas.merge
c. DataFrame.combine_first
d. DataFrame.join
3 Define the objective of imputation process 1 1 3 5 5.4.1
a. Remove entire rows or columns containing missing values
b. Remove pairs of observations where at least one value is missing
c. Replacing missing data with estimated values
d. Remove noise from the dataset using some algorithms
4 Identify the reshape process among the following that turns unique values 1 2 3 5 5.4.1
from one column into new column headers, effectively transforming long-
form data to wide -form
a. Melting
b. Stacking
c. Pivoting
d. Unstacking
Part – B (4 x 5 = 20 Marks)
Instructions: Answer ANY FOUR Questions
Q. Question Marks BL CO PO PI
No Code
11 Discuss different data structures that help optimize memory and 5 2 3 5 5.6.1
computation while handling large data volumes. Briefly review their
strengths and weaknesses.
Ans:
Data structures have different storage requirements, but also
influence the performance of CRUD (create, read, update, and
delete) and other operations on the data set
Ans:
1. Convert datasets to DataFrames
df_a = pd.DataFrame(data_a)
df_b = pd.DataFrame(data_b)
Ans:
Z-score normalization is a data preprocessing technique that
transforms numerical data to have a mean of 0 and a standard
deviation of 1. This is particularly useful when dealing with features
that have different scales or units, as it ensures that all features
contribute equally to the model.
Advantages:
1. Handles different Scales
2. Improves Machine Learning Models
3. Reduce Bias
4. Helps with outliers
Min-max normalization is a data preprocessing technique that
scales numerical data to a specific range, typically between 0 and 1.
It's useful when you want to preserve the relative distances between
data points while ensuring that all features have a similar scale
14 Write the python code for creating s 2 X 2 grid of plots with the 5 3 4 5 5.5.2
following subplots using matplotlib.pyplot
1. Grid 1 – line plot
2. Grid 2 – Scatter plot
3. Grid 3 – Bar
4. Gid 4 – histogram
Ans:
import matplotlib.pyplot as plt
import numpy as np
#Data
x = np.arange(1, 6)
y = x ** 2
categories = ['A', 'B', 'C', 'D', 'E']
values = [5, 7, 3, 8, 6]
hist_data = np.random.randn(1000)
#Plotting
plt.figure(figsize=(10, 8))
plt.subplot(2, 2, 1)
plt.plot(x, y, marker='o')
plt.title('Line Plot')
plt.subplot(2, 2, 2)
plt.scatter(x, y, color='green')
plt.title('Scatter Plot')
plt.subplot(2, 2, 3)
plt.bar(categories, values, color='orange')
plt.title('Bar Plot')
plt.subplot(2, 2, 4)
plt.hist(hist_data, bins=20, color='purple')
plt.title('Histogram')
plt.tight_layout()
plt.show()
15 You are given a dataset that contains the daily temperature (Temp), 5 3 5 5 5.5.2
humidity (Humidity), and air quality index (AQI) recorded over 5 days
.
Days = [1,2,3,4,5]
Temperature = [23,25,28,32,35]
AQI = [3,5,4,2,5]
Write Python code using Seaborn and Matplotlib to visualize the
relationship among these three variables using a 3D line plot, where:
• X-axis → Day (as a sequence)
• Y-axis → Temperature
• Z-axis → AQI
Ans:
import matplotlib.pyplot as plt
import seaborn as sns
# Data
Days = [1, 2, 3, 4, 5]
Temperature = [23, 25, 28, 32, 35]
AQI = [3, 5, 4, 2, 5]
# Create 3D plot
sns.set(style="whitegrid")
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
# Label axes
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (°C)')
ax.set_zlabel('AQI')
ax.set_title('3D Line Plot of Day vs Temperature vs AQI')
# Show plot
plt.legend()
plt.show()
Part – C (2 x 10 = 20 Marks)
Instructions: Answer ALL questions.
Q. Question Marks BL CO PO PI
No Code
16 a How missing values are represented in a dataset? With examples, 10 2 3 5 5.5.1
describe the various imputation techniques used for handling of missing
values so that there is minimum loss of information.
Ans:
Imputation is the process of replacing missing data with estimated
values to maintain dataset integrity.
(OR)
Ans:
import pandas as pd
# Create dataframe
data = {
'Customer_Info': [
" Mr. Ramesh K , Chennai - 600001 ",
"Ms. PRIYA D,COIMBATORE-641002",
"Dr. Arjun,Madurai - 625001",
"Mrs. Leela S , Chennai - 6251 "
]
}
df = pd.DataFrame(data)
df['Customer_Info'] = df['Customer_Info'].str.strip()
17 a Explain the features of Seaborn library. Also describe the importance 10 2 4 5 5.5.1
of Facet Grid, joint plot and pair plot with example implementation.
Ans:
• Seaborn is a library mostly used for statistical plotting in
Python.
• It is built on top of Matplotlib and provides beautiful default
styles and color palettes to make statistical plots more
attractive.
Features of Seaborn
3D Plots
sns.pairplot(df)
(OR)
17 b You are provided with a sample dataset of product sales in a CSV file 10 3 5 5 5.5.2
named product_sales.csv. The dataset contains the following columns:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
CO Coverage
60% 55%
50% 45%
40%
30%
20%
10%
0%
SRM Institute of Science and Technology
Set -
College of Engineering and Technology
School of Computing
SRM Nagar, Kattankulathur – 603203, Chengalpattu District, Tamil Nadu
Academic Year: 2024-25 (EVEN)
7 c) Pair plot 1
8 a) plt.style.use('seaborn-darkgrid') 1
9 b) sns.histplot() 1
11 Discuss the general programming tips to deal with large data sets. 5
Don’t reinvent the wheel. Use tools and libraries developed by others
Get the most out of your hardware. Your machine is never used to it full potential;
with simple adaptions you can make it work harder.
Reduce the computing need. Slim down your memory and processing needs as much
as possible.
12 When merging two DataFrames in pandas that have columns with the same name, 5
how can you ensure the column names are distinguishable?
Use the suffixes parameter in the merge() function to add distinguishing suffixes to
overlapping column names.
import pandas as pd
df1 = pd.DataFrame({'ID': [1, 2], 'Value': [10, 20]})
df2 = pd.DataFrame({'ID': [1, 2], 'Value': [30, 40]})
merged_df = pd.merge(df1, df2, on='ID', suffixes=('_left', '_right'))
print(merged_df)
13 Given the dataset data ={'Ages': [3, 18, 22, 10, 25, 29, 34, 14, 40, 45, 50, 55, 60, 12, 65, 5
70, 75, 80, 85]}, categorize the continuous Ages values into the groups of children,
young, middle, and elder. Define appropriate age ranges for each category and
implement the conversion.
import pandas as pd
data = {'Ages': [3, 18, 22, 10, 25, 29, 34, 14, 40, 45, 50, 55, 60, 12, 65, 70, 75, 80,
85]}
df = pd.DataFrame(data)
bins = [0, 12, 24, 59, 100]
labels = ['Child', 'Young', 'Middle', 'Elder']
df['Category'] = pd.cut(df['Ages'], bins=bins, labels=labels)
print(df)
14 Compare a box plot and a histogram, highlighting their use cases and strengths. 5
Box Plot:
Displays the distribution of data and highlights outliers.
Ideal for comparing multiple datasets.
Histogram:
Shows the frequency distribution of data values.
Useful for understanding the shape of the data (e.g., skewness).
15 How can you control the line properties (e.g., color, style, and width) of a chart in 5
Matplotlib. Write the python code and explain.
16 a Missing Data: 10
Fill missing sales values with the median (robust to outliers). Drop rows
if there are very few missing values.
Example Code:`
df['Sales'] = df['Sales'].fillna(df['Sales'].median())
Irregular Formats:
Convert all dates into a uniform format (YYYY-MM-DD) using
pd.to_datetime.
Example Code:
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
Duplicate Records:
Remove rows where Product, Region, and Date are duplicated, keeping the
first occurrence:
Example Code:
df = df.drop_duplicates(subset=['Product', 'Region', 'Date'], keep='first')
Irrelevant Data:
Drop unnecessary or irrelevant columns like Transaction ID
Example Code:
df = df.drop(columns=['Transaction ID'])
Outliers:
Identify outliers in Sales using the interquartile range (IQR)
Example Code:
Q1 = df['Sales'].quantile(0.25)
Q3 = df['Sales'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df['Sales'] >= lower_bound) & (df['Sales'] <= upper_bound)]
Categorical Inconsistencies:
Standardize inconsistent product names using a mapping dictionary:
Example Code:
product_mapping = {'Appl': 'Apple', 'Bananaa': 'Banana'}
df['Product'] = df['Product'].replace(product_mapping)
Merging:
Load the profit margins dataset and merge with the sales data on Product and
Region
Example Code:
profit_data = pd.read_csv('profit_margins.csv')
df = pd.merge(df, profit_data, on=['Product', 'Region'], how='inner')
.
Final Quality Checks
Ensure all columns have consistent data types:
df['Sales'] = df['Sales'].astype(float)
df['Date'] = pd.to_datetime(df['Date'])
print(df.isnull().sum())
16 b Output of pivot_df = df.pivot(index='Date', columns='Product', 10
values='Sales')
The pivot function reshapes the DataFrame by specifying:
index: Rows of the resulting DataFrame (Date here).
columns: Columns of the resulting DataFrame (Product here).
values: Data to fill the cells (Sales here).
Output:
# Example Dataset
data = sns.load_dataset('iris')
# Pair Plot
sns.pairplot(data, hue='species')
Functionality:
Displays scatter plots between every pair of numerical columns.
Includes diagonal histograms to visualize the distribution of each feature.
Uses hue to color the data points based on a categorical column (species).
2. Box Plot
A box plot summarizes the distribution of a dataset through five-number summary
statistics: minimum, first quartile (Q1), median, third quartile (Q3), and maximum.
It also highlights potential outliers.
Code Example:
# Box Plot
sns.boxplot(x='species', y='sepal_width', data=data)
Functionality:
Displays distributions and compares groups (e.g., species) for a numerical
column (sepal_width).
Identifies outliers as points outside the whiskers.
Can be enhanced with swarm plots to overlay individual data points.
3. Histogram
A histogram visualizes the distribution of a single numerical variable by grouping
data into bins.
Code Example:
# Histogram
sns.histplot(data['sepal_length'], kde=True, bins=20)
Functionality:
Shows the frequency of data points within specified bins.
Optionally overlays a kernel density estimate (KDE) curve for a smoothed
representation of the distribution.
Parameters like bins control the granularity of the visualization.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Generate data
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
Z = np.sin(np.sqrt(X**2 + Y**2))
A) NumPy
B) Matplotlib
C) Bcolz
D) Seaborn
2 What does the method combine_first () do in data wrangling? 1 1 3 5
A) Data compression
B) Parallel processing
C) Data visualization
D) Batch learning
4 Which Python library enables parallel execution and optimization of computation 1 2 3 5
flow?
A) Pandas
B) Matplotlib
C) Dask
D) Numexpr
A) To remove duplicates
B) To convert wide data into long format
C) To merge datasets
D) To perform statistical analysis
6 Which function is used to create a histogram in Matplotlib? 1 1 4 5
A) plot()
B) hist()
C) bar()
D) scatter()
7 What does plt.legend() do in a Matplotlib plot? 1 1 4 5
A) jointplot()
B) pairplot()
C) distplot()
D) catplot()
9 Which parameter controls the resolution of the saved figure using savefig()? 1 2 5 5
A) dpi
B) bbox_inches
C) pad_inches
D) format
10 Which type of annotation includes text with arrows to highlight specific points? 1 2 5 5
A) Tick
B) Title
C) Callout
D) Label
Register
Number
Part – B (4 x 5 = 20 Marks)
Instructions: Answer ANY FOUR Questions
Q. Question Marks BL CO PO PI
No Code
Explain the issues faced when handling large datasets and suggest suitable 5 2 3 5
techniques to address them.
Answer:
Issues:
Memory overload: Large datasets exceed available RAM,
causing system slowdown or crashes.
Slow processing: Algorithms may become inefficient due to data
volume.
CPU starvation: Inefficient use of processing power leads to idle
11 CPU time.
I/O bottlenecks: Reading/writing large data to/from disk is slow.
Techniques:
Data compression: Use tools like Bcolz to reduce memory usage.
Chunking: Process data in smaller batches.
Parallelism: Tools like Dask allow computations across multiple
CPU cores.
Efficient libraries: Use optimized Python tools like Numexpr,
Numba, and Theano.
Answer:
Techniques:
1. Detect missing data:
df.isnull().sum()
4. Forward/Backward fill:
df = pd.merge(df1, df1.set_index('ID',
df2, on='ID', inplace=True)
how='inner') df2.set_index('ID',
inplace=True)
df1.join(df2,
how='outer')
Creating subplots:
Annotating a point:
plt.annotate('Peak', xy=(2, 5), xytext=(2, 6),
arrowprops=dict(facecolor='black', arrowstyle='->'))
Explain the purpose and usage of Pair Plots and Joint Plots in Seaborn with 5 3 5 5
example code.
pairplot():
Displays pairwise relationships in a dataset.
Useful for exploring patterns and correlations.
Jointplot():
# Pair plot
sns.pairplot(df, hue="species")
# Joint plot
sns.jointplot(x='sepal_length', y='sepal_width', data=df)
Part – C (2 x 10 = 20 Marks)
Instructions: Answer ALL questions.
Q. Question Marks BL CO PO PI
No Code
Explain various data wrangling operations such as reshaping, pivoting,
and merging in pandas with examples.
(7 Marks)
Reshaping:
pivot() – converts long to wide format.
melt() – converts wide to long format.
Merging:
merge() – combines datasets using key(s).
join() – merges using index.
16 a concat() – appends datasets row/column-wise. 10 2 3 5
Pivot
df.pivot(index='Date', columns='City', values='Sales')
Melt
df.melt(id_vars='Date', var_name='City', value_name='Sales')
Merge
pd.merge(df1, df2, on='ID', how='inner')
Standardization:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['Age']])
16 b 10 3 3 5
Outlier Detection:
IQR method:
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Age'] < Q1 - 1.5*IQR) | (df['Age'] > Q3 + 1.5*IQR)]
(OR)
Construct different Seaborn visualizations including pair plots, scatter
plots, and joint plots, and explain their use in analysis.
(5 marks)
Pair Plot:
Explores multiple variables at once.
sns.pairplot(df)
Scatter Plot:
Visualizes relationship between two variables.
sns.scatterplot(x='Age', y='Income', data=df)
17 b 10 3 5 5
Joint Plot:
Combines scatter and histograms for deeper insight.
sns.jointplot(x='Age', y='Income', data=df)
(5marks)
Explain the plots using diagrams.
Use cases:
Detect correlation
Identify clusters and trends
Explore distributions
CO Coverage
60% 55%
50% 45%
40%
30%
20%
10%
0%
Register
Number
7 Consider the code below that creates a scatter plot with Seaborn: 1 1 4 5
sns.relplot(x="sepal_length", y="sepal_width",
data=iris, hue="species",
kind="scatter", alpha=0.7)
Which of the following statements best explains the use of alpha=0.7?
A. It reduces the marker size.
B. It adjusts the transparency to help visualize overlapping points.
C. It changes the color palette.
D. It increases the line width for plot boundaries.
9 In the following code snippet, what is the role of the rstride and 1 2 5 5
cstride parameters?
surf = ax.plot_surface(X, Y, Z, cmap='viridis',
rstride=1, cstride=1)
A. They define the number of rows and columns in the data grid.
B. They control the sampling (row and column stride) of the input data
for rendering the surface.
C. They set the resolution of the color mapping.
D. They adjust the transparency of the surface.
Part – B (4 x 5 = 20 Marks)
Instructions: Answer ANY FOUR Questions
Q. Question Marks BL CO PO PI
No Code
import pandas as pd
13 Give a credit risk model for a fintech startup. The dataset includes 5 2 3 5
columns: credit_score, income, loan_amount, defaulted (Yes/No), and
age. Perform the following task to prepare the data for modeling.
a. Group credit_score into risk categories: 'Poor', 'Fair', 'Good',
'Excellent'.
b. Standardize income and loan_amount.
c. Summarize the average loan amount and default rate for each
credit risk category.
d. Explain why binning and standardization are important in this
context.
Step-by-Step Data Preparation
a. Group credit_score into risk categories
categorize credit scores into bins:
import pandas as pd
import numpy as np
# Example DataFrame
df = pd.DataFrame({
'credit_score': [580, 660, 710, 780, 620],
'income': [30000, 45000, 60000, 80000, 35000],
'loan_amount': [5000, 7000, 10000, 12000, 6000],
'defaulted': ['Yes', 'No', 'No', 'No', 'Yes'],
'age': [25, 35, 45, 50, 30]
})
Output:
credit_score income loan_amount defaulted age risk_category
0 580 30000 5000 Yes 25 Poor
1 660 45000 7000 No 35 Good
2 710 60000 10000 No 45 Good
3 780 80000 12000 No 50 Excellent
4 620 35000 6000 Yes 30 Fair
b. Standardize income and loan_amount
Standardization centers values to a mean of 0 and a standard deviation of
1:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['income_scaled', 'loan_amount_scaled']] =
scaler.fit_transform(df[['income', 'loan_amount']])
print(df)
Output:
credit_score income loan_amount defaulted age risk_category \
0 580 30000 5000 Yes 25 Poor
1 660 45000 7000 No 35 Good
2 710 60000 10000 No 45 Good
3 780 80000 12000 No 50 Excellent
4 620 35000 6000 Yes 30 Fair
income_scaled loan_amount_scaled
0 -1.100964 -1.150447
1 -0.275241 -0.383482
2 0.550482 0.766965
3 1.651446 1.533930
4 -0.825723 -0.766965
c. Summarize average loan and default rate per risk category
# Convert 'defaulted' to binary
df['defaulted_binary'] = df['defaulted'].map({'Yes':
1, 'No': 0})
print(summary)
output:
avg_loan_amount default_rate
risk_category
Poor 5000.0 1.0
Fair 6000.0 1.0
Good 8500.0 0.0
Excellent 12000.0 0.0
# First subplot: y = x
axes[0].plot(x, y1, color='blue')
axes[0].set_title('Plot of y = x')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
output:
15 Write a Python program that demonstrates the use of 3D plotting by doing 5 3 5 5
the following:
Create a 3D plot using any mathematical function or parametric
equations of your choice.
Plot the data using a 3D axis (ax = fig.add_subplot(...,
projection='3d')).
Customize the plot using color maps, line styles, or markers for
better visualization.
import numpy as np
import matplotlib.pyplot as plt
# Customize labels
ax.set_title('3D Surface Plot of z = sin(sqrt(x² +
y²))', fontsize=14)
ax.set_xlabel('X-axis')
ax.set_ylabel('Y-axis')
ax.set_zlabel('Z-axis')
Output:
Part – C (2 x 10 = 20 Marks)
Instructions: Answer ALL questions.
Q. Question Marks BL CO PO PI
No Code
16 a Consider the basic dataset that contains student details collected 10 2 3 5
during admissions. The dataset contains errors and inconsistencies
that need to be addressed before it can be used for reporting and
visualization.
student_id Name Age Email grade
1 John Smith 20 john.smith@email.com A
2 SARA -1 sara123@email.com B+
3 Riya Kapoor NaN riya_kapoor@gmail A
4 Tom Brown 19 tom.brown@email.com None
5 22 B
6 alex johnson 0 alex.j@email.com A+
Write Python code to perform the following data cleaning
operations:
a. Identify and remove rows where the name or email is
missing or blank.
b. Replace invalid age values (e.g., 0, -1, or NaN) with the
mean age of valid entries.
c. Strip extra spaces in the name column and convert all
names to proper title case.
d. Standardize grade values by replacing None with
"Incomplete".
e. Remove rows with invalid email addresses (those without
"@" or a "." after the "@").
f. Display a summary of the cleaned dataset using
df.describe() or df.info().
g. Explain two potential risks if this dataset is used in its raw
form for decision-making.
Python code:
import pandas as pd
import numpy as np
data = {
'student_id': [1, 2, 3, 4, 5, 6],
'Name': ['John Smith', 'SARA', 'Riya
Kapoor', 'Tom Brown', '', 'alex johnson'],
'Age': [20, -1, np.nan, 19, 22, 0],
'Email': ['john.smith@email.com',
'sara123@email.com', 'riya_kapoor@gmail',
'tom.brown@email.com', '',
'alex.j@email.com'],
'grade': ['A', 'B+', 'A', None, 'B', 'A+']
}
df = pd.DataFrame(data)
print(df)
Identify and remove rows where the name or email is missing
or blank.
df = df[(df['Name'].notna()) &
(df['Name'].str.strip() != '') &
(df['Email'].notna()) &
(df['Email'].str.strip() != '')]
print(df)
output:
student_id Name Age Email grade
0 1 John Smith 20.0 john.smith@email.com A
1 2 SARA -1.0 sara123@email.com B+
2 3 Riya Kapoor NaN riya_kapoor@gmail A
3 4 Tom Brown 19.0 tom.brown@email.com None
5 6 alex johnson 0.0 alex.j@email.com A+
Replace invalid age values (e.g., 0, -1, or NaN) with the mean
age of valid entries.
valid_ages = df['Age'][df['Age'] > 0]
mean_age = valid_ages.mean()
df['Age'] = df['Age'].apply(lambda x: mean_age
if pd.isna(x) or x <= 0 else x)
print(df)
output:
student_id Name Age Email grade
0 1 John Smith 20.0 john.smith@email.com A
1 2 SARA 19.5 sara123@email.com B+
2 3 Riya Kapoor 19.5 riya_kapoor@gmail A
3 4 Tom Brown 19.0 tom.brown@email.com None
5 6 alex johnson 19.5 alex.j@email.com A+
Strip extra spaces in the name column and convert all names to
proper title case.
df['Name'] =
df['Name'].str.strip().str.title()
print(df)
Output:
student_id Name Age Email grade
0 1 John Smith 20.0 john.smith@email.com A
1 2 Sara 19.5 sara123@email.com B+
2 3 Riya Kapoor 19.5 riya_kapoor@gmail A
3 4 Tom Brown 19.0 tom.brown@email.com None
5 6 Alex Johnson 19.5 alex.j@email.com A+
Standardize grade values by replacing None with "Incomplete".
df['grade'] = df['grade'].fillna('Incomplete')
print(df)
Output:
student_id Name Age Email grade
0 1 John Smith 20.0 john.smith@email.com A
1 2 Sara 19.5 sara123@email.com B+
2 3 Riya Kapoor 19.5 riya_kapoor@gmail A
3 4 Tom Brown 19.0 tom.brown@email.com Incomplete
5 6 Alex Johnson 19.5 alex.j@email.com A+
Remove rows with invalid email addresses (those without "@"
or a "." after the "@").
def is_valid_email(email):
if "@" in email:
local, _, domain =
email.partition("@")
return "." in domain
return False
df = df[df['Email'].apply(is_valid_email)]
print(df)
Output:
student_id Name Age Email grade
0 1 John Smith 20.0 john.smith@email.com A
1 2 Sara 19.5 sara123@email.com B+
3 4 Tom Brown 19.0 tom.brown@email.com Incomplete
5 6 Alex Johnson 19.5 alex.j@email.com A+
Explain two potential risks if this dataset is used in its raw
form for decision-making.
transactions = pd.DataFrame({
'Customer_ID': ['C001', 'C002', 'C004'],
'Date': ['2024-10-01', '2024-10-02', '2024-
10-02'],
'Purchase_Amount': [250, 100, 300]
})
# Merging on Customer_ID
merged_df = pd.merge(customers, transactions,
on='Customer_ID')
print(merged_df)
# Sample data
df = pd.DataFrame({'Age': [22, 25, 30, 30, 35,
40, 45, 50, 55, 60]})
(OR)
17 b Describe annotation techniques used in data visualization using 10 3 5 5
Python. Explain the importance of annotations in plots and
demonstrate how annotations can be added using Matplotlib and
Seaborn with appropriate code examples. Include different types of
annotations such as text, arrows, and labels on bar charts, line plots,
and scatter plots.
Annotations are crucial in data visualization as they help highlight
important information, clarify data points, and guide interpretation. In
Python, both Matplotlib and Seaborn support annotation techniques—
since Seaborn builds on Matplotlib, annotations typically use
Matplotlib's functions under the hood.
x = [1, 2, 3, 4]
y = [10, 20, 25, 30]
plt.plot(x, y, marker='o')
plt.text(2, 20, 'Second Point', fontsize=12,
color='red')
plt.title("Text Annotation Example")
plt.show()
plt.bar(categories, values)
for i, v in enumerate(values):
plt.text(i, v + 0.5, str(v), ha='center',
fontweight='bold')
plt.title("Bar Chart with Value Labels")
plt.show()
# Sample data
df = sns.load_dataset('tips')
sns.scatterplot(data=df, x='total_bill',
y='tip')
# Annotate peak
plt.annotate(
'Peak Sales',
xy=('Fri', 300),
xytext=('Wed', 310),
arrowprops=dict(arrowstyle='->',
color='red'),
color='red'
)
plt.show()
Annotation Techniques
Technique Function Use Case
plt.text() Add static text Labeling bars or points
plt.annotate() Text + arrows Highlighting specific features
Bar label
ax.bar_label() Labeling each bar
shortcut
Seaborn + Highlights in
Same as Matplotlib, post-plot
annotate plots
CO Coverage
60% 55%
50% 45%
40%
30%
20%
10%
0%
Register
Number
Part – B (4 x 5 = 20 Marks)
Instructions: Answer ANY FOUR Questions
Q. Question Marks BL CO PO PI
No Code
5 2 3 5
11 Discuss the various methods of handling missing data in the dataset.
Listwise Deletion: Remove entire rows or columns containing missing
values. This method is simple but can result in a significant loss of data,
especially if there are many missing values.
For example, for a column like Color, if there are missing values, you can
replace them with "Unknown" or "Missing".
Pandas method: df['Color'].fillna('Unknown').
Preserves information about the missingness.
This method can potentially introduce noise, as the new category may not
represent an actual value.
Results may vary depending on the implementation and how the algorithm
handles the missing values.
5. Multiple Imputation
This technique involves creating multiple datasets with different imputed
values and then combining the results to account for uncertainty in the
imputation process.
A model is trained using the non-missing data and then used to predict
missing values.
7. Leave Missing Values As-Is (For Some Models)
In some cases, particularly when using deep learning models, it may be
acceptable to leave missing values as they are and let the model learn how
to handle them during training.
Models like neural networks can handle missing data if they are explicitly
designed to do so.
May lead to poor model performance if the model does not handle missing
values well.
5 3 3 5
12 Explain various data transformation techniques used in data
preprocessing.
Data Generalization:
Data generalization is the process of converting detailed data into a more
abstract, higher-level representation while retaining essential
information.
It is commonly used in data mining, privacy preservation, and machine
learning to reduce complexity and improve model generalization.
Types
Attribute Generalization
Hierarchical Generalization
Numeric Generalization
Text Generalization
5 2 3 5
13 Write a Python program that accepts a sentence from the user and
performs the following string operations:
1. Display the total number of words in the sentence.
2. Convert the entire sentence to title case (first letter capitalized).
3. Find and display the number of times the word 'the' appears
(case insensitive).
4. Replace all occurrences of the word 'and' with '&'.
# Accept sentence from the user
sentence = input("Enter a sentence: ")
# 3. Find and display the number of times the word 'the' appears (case
insensitive)
word_count_the = sentence.lower().split().count('the')
print(f"Number of times the word 'the' appears: {word_count_the}")
5 3 4 5
14 Explain the concept of subplots in Matplotlib with suitable examples.
#### 1. **`plt.subplot()`**
The `subplot()` function divides the figure into a grid and places a
subplot in a specific position within that grid.
**Syntax**:
```python
plt.subplot(nrows, ncols, index)
```
**Example**:
```python
import matplotlib.pyplot as plt
#### 2. **`plt.subplots()`**
The `subplots()` function creates a grid of subplots and returns both the
**figure** and **axes** objects. This is a more flexible and modern
approach compared to `plt.subplot()`, especially when working with
multiple subplots.
**Syntax**:
```python
fig, axes = plt.subplots(nrows, ncols)
```
**Example**:
```python
import matplotlib.pyplot as plt
5 3 5 5
15 Define annotations in the context of data visualization using Matplotlib
and briefly explain the types of annotations used.
Syntax:
plt.text(x, y, 'Text', fontsize=12,
color='red', ha='center', va='center')
Example:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
plt.plot(x, y)
plt.text(3, 20, 'This is a point', fontsize=12,
color='blue', ha='left')
plt.show()
Syntax:
plt.annotate('Text', xy=(x, y),
xytext=(x_offset, y_offset),
arrowprops=dict(facecolor='blue', arrowstyle='-
>'))
Example:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
plt.plot(x, y)
plt.annotate('Point (3, 9)', xy=(3, 9),
xytext=(4, 10),
arrowprops=dict(facecolor='red',
arrowstyle='->'))
plt.show()
Syntax:
plt.text(x, y, 'Text',
bbox=dict(facecolor='yellow', alpha=0.5))
Example:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
plt.plot(x, y)
plt.text(3, 20, 'This is a point', fontsize=12,
color='blue', bbox=dict(facecolor='yellow',
alpha=0.5))
plt.show()
Example:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
plt.plot(x, y)
arrowprops=dict(facecolor='green',
arrowstyle='->'))
plt.show()
Example:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
plt.plot(x, y)
plt.scatter([3], [9], color='red', s=100) #
Highlight a specific point
plt.text(3, 9, 'Highlighted Point',
fontsize=12, color='black', ha='center',
va='center')
plt.show()
Part – C (2 x 10 = 20 Marks)
Instructions: Answer ALL questions.
Q. Question Marks BL CO PO PI
No Code
10 2 3 5
16 a Discuss the major challenges encountered while working with large
datasets and how these challenges impact data preprocessing, storage,
and analysis.
2. Storage Challenges
a) Storage Capacity
3. Analysis Challenges
a) Slow Computation and Processing Time
b) Privacy Issues
(OR)
10 3 3 5
16 b Explain the concept of data wrangling and discuss the key steps
involved in the data wrangling process and the importance of each step.
Data wrangling (also known as data munging) is the process of
transforming and mapping raw data into a more useful and
accessible format for analysis. It involves cleaning, restructuring,
and enriching raw data from various sources to make it suitable
for analysis and decision-making. Data wrangling is often
considered one of the most time-consuming and important tasks
in the data analysis pipeline.
1. Data Collection
o Description: Gathering data from various sources
(e.g., databases, flat files like CSV, JSON, XML,
APIs, web scraping, or sensor data).
o Importance: This is the foundational step where
the raw data is gathered. It sets the stage for all
subsequent steps in the wrangling process.
o Challenges: Data could be incomplete, in
inconsistent formats, or in a form that is difficult to
analyze.
o Tools: APIs, web scraping tools (e.g.,
BeautifulSoup), SQL queries, data import
functions in libraries (e.g., pandas.read_csv()).
2. Data Inspection/Exploration
o Description: Inspecting the dataset to understand
its structure, content, and identify any problems
such as missing values, duplicates, or incorrect
formats.
o Importance: This step helps to get a feel for the
data and ensures that any issues or anomalies are
identified before any transformations are done.
o Challenges: Data might be large, unstructured, or
might contain inconsistencies that are hard to
detect manually.
o Tools: pandas (e.g., df.info(), df.describe(),
df.head()), matplotlib, seaborn (for
visualization), or any other exploratory data
analysis (EDA) tool.
3. Data Cleaning
o Description: Removing or correcting any errors in
the data, such as missing values, duplicates,
inconsistent data types, or outliers.
o Importance: Cleaning ensures the accuracy and
quality of the data. Poor-quality data can lead to
misleading results in analysis or modeling.
o Challenges: Dealing with missing values,
correcting inconsistent data entries, handling noisy
data.
o Tools: pandas (fillna(), dropna(),
drop_duplicates(), astype()), numpy (e.g.,
np.nan for missing values).
4. Data Transformation
o Description: Transforming the data into a more
suitable format for analysis. This may involve
normalizing or scaling numerical values,
converting categorical variables to numerical ones,
or reshaping the data.
o Importance: Transformations help prepare the
data for various types of analysis or modeling.
Some algorithms require data to be in a specific
format (e.g., scaling for neural networks).
o Challenges: Applying the right transformations
can be complex, especially with heterogeneous
data types (e.g., mixing categorical and numerical
data).
o Tools: pandas (e.g., pd.get_dummies() for one-
hot encoding, StandardScaler from sklearn for
scaling), numpy for mathematical transformations.
5. Data Integration
o Description: Combining data from multiple
sources or datasets, ensuring that the combined
data is consistent and compatible.
o Importance: Many datasets are spread across
different sources. Integration allows data from
these sources to be merged into a single dataset for
analysis.
o Challenges: Merging datasets may introduce
discrepancies (e.g., mismatched keys, inconsistent
formats) that need to be resolved.
o Tools: pandas (e.g., merge(), concat()), SQL
join operations, or using ETL tools for larger
datasets.
6. Data Enrichment
o Description: Enhancing the dataset with additional
information, such as external data sources or
creating new features.
o Importance: Enriching the data helps improve the
quality and comprehensiveness of the dataset,
allowing for more insightful analysis.
o Challenges: Adding external data can introduce its
own inconsistencies or issues like missing values.
o Tools: APIs, web scraping, and additional datasets
from open data repositories.
7. Data Formatting
o Description: Converting data into the required
format, such as ensuring that numerical columns
are numeric and categorical columns are properly
labeled.
o Importance: Correct formatting is essential for the
subsequent steps in the analysis or modeling
pipeline.
o Challenges: Ensuring all columns are consistently
formatted, especially when dealing with large
datasets with diverse data types.
o Tools: pandas for type casting (e.g.,
df['column'].astype(int)), str functions for
string manipulation.
8. Data Sampling/Resampling (if needed)
o Description: Reducing the dataset size by
sampling a subset of data (if the dataset is too
large) or balancing the dataset (e.g., in
classification problems with imbalanced classes).
o Importance: Sampling can reduce the
computational complexity and speed up the
analysis, while resampling ensures that models are
not biased due to class imbalances.
o Challenges: Ensuring that the sample is
representative of the full dataset and that
resampling does not distort the underlying
patterns.
o Tools: pandas (e.g., df.sample()), imblearn for
oversampling/undersampling.
9. Data Validation
o Description: Ensuring that the cleaned,
transformed, and integrated data meets the
requirements of the analysis or machine learning
models.
o Importance: Validation ensures that the dataset is
accurate, complete, and ready for use in the next
stage of analysis or modeling.
o Challenges: Performing robust validation,
especially with large datasets, can be difficult and
time-consuming.
o Tools: Manual checks, statistical methods, or
automated validation scripts.
# Data for the pie chart: Course names and the number
of students
courses = ['Python', 'Java', 'C++', 'AI']
students = [150, 120, 90, 60]
Explanation:
Output:
This will display a pie chart showing the percentage distribution
of students in the Python, Java, C++, and AI courses.
Let me know if you'd like any adjustments or further explanation
on any part!
Explanation:
Data: The days list represents the days of the week, and
the visitors list represents the number of visitors to the
website for each corresponding day.
Line Graph: The plt.plot() function is used to plot the
line graph.
o marker='o' adds a marker at each data point (a
circle in this case).
o linestyle='-' ensures that the points are
connected with a line.
o color='b' sets the line color to blue.
Title and Labels: plt.title(), plt.xlabel(), and
plt.ylabel() are used to set the title and axis labels.
Grid: plt.grid(True) adds a grid to the graph to make it
easier to read the values.
Output:
This will display a line graph representing the number of visitors
to a website over the span of 7 days (Monday to Sunday).
Let me know if you need any further adjustments or explanations!
(OR)
17 b 10 3 5 5
i) Define Seaborn? How does it differ from Matplotlib?
Write a Python program to draw a scatter plot using Seaborn
showing the relationship between height and weight of
individuals.
What is Seaborn?
Seaborn is a Python visualization library built on top of
Matplotlib. It provides a high-level interface for drawing
attractive and informative statistical graphics. Seaborn comes
with several built-in themes and color palettes that make it easy to
generate aesthetically pleasing plots with minimal code.
1. Ease of Use:
o Matplotlib: While powerful and highly
customizable, Matplotlib requires more lines of
code to generate common statistical plots. It is
great for creating basic and complex plots but can
be verbose.
o Seaborn: It is built to simplify the process of
creating complex visualizations, especially for
statistical data. It provides high-level functions that
automatically handle many details, such as axes
labels, legends, color schemes, etc.
2. Style and Aesthetics:
o Matplotlib: While Matplotlib can generate a wide
range of plots, the default style is relatively basic.
Customizing the appearance (e.g., changing colors,
themes) requires extra work.
o Seaborn: It comes with built-in themes, color
palettes, and automatic formatting, making it much
easier to generate more visually appealing plots
with minimal customization.
3. Statistical Plotting:
o Matplotlib: It is primarily focused on general
plotting but does not offer built-in support for
statistical visualizations (e.g., heatmaps, regression
plots).
o Seaborn: It includes specialized functions for
creating statistical plots like regression plots, box
plots, violin plots, and heatmaps, making it ideal
for exploratory data analysis.
4. Integration with Pandas:
o Matplotlib: While it can work with Pandas
DataFrames, it doesn't provide direct support for
DataFrame operations.
o Seaborn: It works seamlessly with Pandas
DataFrames and provides functions that directly
accept DataFrame columns as input.
Explanation:
Explanation:
CO Coverage
60 53 %
50
40
30 26 %
21 %
20
10
0
CO 1 CO 2 CO 3