0% found this document useful (0 votes)

26 views9 pages

Assignment 1

Uploaded by

iorikitahara77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views9 pages

Assignment 1

Uploaded by

iorikitahara77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Statistical Methods in AI - Monsoon 2025

Assignment 1
Deadline : 26 August 2025 11:59 P.M.
Instructors: Prof Ravi Kiran Sarvadevabhatla, Prof Saikiran Bulusu

General Instructions
• Your assignment must be implemented in Python.
• Clearly label and organize your code, including comments that explain the purpose of
each section and key steps in your implementation. Ensure that your files are well-
structured, with headings, subheadings, and explanations as necessary.
• Make sure to test your code thoroughly before submission to avoid any runtime errors
or unexpected behavior.
• Your assignment will be evaluated not only based on correctness but also on the quality
of code, the clarity of explanations, and the extent to which you’ve understood and
applied the concepts covered in the course.
• We are aware of the possibility of submitting the assignment late in GitHub Classroom
using various hacks. Note that we will have measures in place accordingly and anyone
caught attempting to do the same will be given a straight zero in the assignment.

AI Tools Usage Instructions (Mandatory if Applicable)

We are aware how easy it is to write code and solve questions with the help of LLM services,
but we strongly encourage you to figure out the answers yourself. If you use any AI tools
such as ChatGPT, Gemini, Claude, etc., to assist in solving any part of this assignment:
• If you are unable to explain any part of the solution/code during evalu-
ations, that solution/code will be considered plagiarized and you will be
penalized. You must also be able to briefly explain how you modified or verified the
AI-generated content.
• You must include a shareable link to the AI Tool chat history or a screenshot
of the relevant conversation. Do NOT share any private or sensitive personal
information in the AI Tool conversations you include.

1
Submission Instructions
• Submit a self-contained Jupyter notebook containing all code, plots, and printed tables.

• Report all the analysis, comparison and any metrics in the notebook or a separate report
that is part of the submission itself. No external links to cloud storage files, wandb logs
or any other alternate will be accepted as part of your submission. Only the values and
visualizations as part of your commits will be graded.

• Use your institute email ID to generate a personalized random seed:

– Get the part before @ in your IIITH email

e.g., username from username@program.iiit.ac.in.
– Use the SHA-256 hash of this string to ensure uniqueness:
import hashlib
seed = int ( hashlib . sha256 ( username . encode () ) . hexdigest () , 16)
% (2**32)

– Use this seed in all random number generators (e.g., np.random.default rng(seed)).

• All plots must include your email username in the title or filename
plt . text (
0.95 , 0.95 , " username " ,
ha = ’ right ’ , va = ’ top ’ ,
transform = plt . gca () . transAxes ,
fontsize =10 , color = ’ gray ’ , alpha =0.7
)

Submission Policy (GitHub Classroom Assignments)

To encourage consistent progress and discourage last-minute submissions, the following pol-
icy applies:

• Minimum Progress Requirement: You must push at least two meaningful com-
mits on different days prior to the deadline.

• Commit Timing Check: If over 80% of commits are made within the last 24 hours
before the deadline, a 10% penalty will be applied.

• Commit Quality: Commits must reflect actual progress. Non-informative or place-

holder commits will not count.

• Final Submission: Your final grade is based on the latest commit before the deadline,
but commit history will be reviewed.

2
Guidelines for Implementation and Code Design
1. Use object-oriented programming. For Q1, all sampling and analysis must use the
original dataset stored in the StudentDataset object.

2. Use appropriate visualization libraries such as matplotlib, seaborn, or plotly. Each

plot must include a title, x-label, y-label, and legend (if applicable). Answers
with missing labels or legends will receive 0 marks.

3. Keep visualization and computation logic in separate functions.

4. Add docstrings to methods explaining what they do.

5. Use reproducible random sampling using the given seed.

Q1.0 Dataset Generation [6 marks]

Generate 10,000 student records with the following attributes:
• gender: Male (65%), Female (33%), Other (2%) [1]

• major: B.Tech (70%), MS (20%), PhD (10%) [1]

• program: distribution conditioned on major: [2]

Major CSE ECE CHD CND

B.Tech 40% 40% 10% 10%
MS 30% 30% 20% 20%
PhD 25% 25% 25% 25%

• GPA: Normally distributed by major, clipped to [4.0, 10.0] [2]

Major GPA Distribution

B.Tech N (7.0, 1.0)
MS N (8.0, 0.7)
PhD N (8.3, 0.5)

You must implement a class as follows.

class StudentDataset :
def __init__ ( self , num_students : int , seed : int ) :
# Generates the full dataset during initialization using the
specified number of students and seed .
def ge t_ fu ll _da ta fr am e ( self ) -> pd . DataFrame :
# Do not regenerate the dataset in different methods or cells .
# Use this method to access the full dataset consistently .

3
def generate_gender ( self ) -> list [ str ]: ...
def generate_major ( self ) -> list [ str ]: ...
def generate_program ( self , majors : list [ str ]) -> list [ str ]: ...
def generate_gpa ( self , majors : list [ str ]) -> list [ float ]: ...
def as se mb le _da ta fr am e ( self ) -> pd . DataFrame : ... # Assemble the
full dataset from gender , major , program , and GPA .

Q1.1 Dataset Analysis

(a) Visualizations [15 marks]
Create suitable visualizations for the following distributions. The visualizations should con-
vey meaningful information about the data.

• gender [1]

• major [1]

• program [1]

• GPA [1]

• program conditioned on major Grouped Bar Chart [1]

• GPA conditioned on major [1]

• GPA conditioned on program [1]

• GPA conditioned on program and major [2]

• gender, major, program and GPA of 100 randomly sampled students [3]

• Summary of entire dataset(e.g. pairplots) [3]

You may implement the following methods:

def p l ot _ g e n d e r _ d i s t r i b u t i o n ( self ) -> None : ...
def p lo t_ m a j o r _ d i s t r i b u t i o n ( self ) -> None : ...
def p l o t _ p r o g r a m _ d i s t r i b u t i o n ( self ) -> None : ...
def p lot_ g p a _ d i s t r i b u t i o n ( self , bins : int = 20) -> None : ...
def p lot_ p r o g r a m _ b y _ m a j o r ( self ) -> None : ...
def plot_gp a_by_m ajor ( self ) -> None : ...
def plot_ g pa _ b y_ p r og r a m ( self ) -> None : ...
def p l o t _ g p a _ b y _ p r o g r a m _ a n d _ m a j o r ( self ) -> None : ...
def plot_ s a m p l e d _ d a t a s e t ( self ) -> None : ...
def p l o t _ e n t i r e _ d a t a s e t _ s u m m a r y ( self ) -> None : ...

4
(b) GPA Summary Statistics [1 mark]
Define a method to compute the mean and standard deviation of GPA:
def gpa_mean_std ( self ) -> tuple [ float , float ]: ...

Report the results and briefly comment on any observations.

(c) Program-Major Combinations [2 marks]

Define a method to count the number of students for each unique (program, major) pairs.
Also write a method to visualize it with a heatmap.
def c o u n t _ s t u d e n t s _ p e r _ p r o g r a m _ m a j o r _ p a i r ( self ) -> pd . Dataframe : ...
def v i s u a l i z e _ s t u d e n t s _ p e r _ p r o g r a m _ m a j o r _ p a i r ( self , counts_df : pd .
Dataframe ) -> None : ...

Report the counts and describe any patterns you observe.

Q1.2 Simple vs Stratified Sampling [5 marks]

• Sample 500 students uniformly at random. Repeat 50 times. Estimate mean GPA and
standard deviation. [2]

• Repeat using stratified sampling by major. Compare results. [2]

• Which method has lower std deviation? Why? [1]

def g et _g p a _ m e a n _ s t d _ r a n d o m ( self , n : int = 500 , repeats : int = 50)

-> tuple [ float , float ]: ...
def g e t _ g p a _ m e a n _ s t d _ s t r a t i f i e d ( self , n : int = 500 , repeats : int =
50) -> tuple [ float , float ]: ...

Q1.3 Gender-Balanced Cohort [5 marks]

• Sample 300 students with exact same representation across genders. Repeat 5 times.
Report gender counts. [1]

• Consider the following Sampling Strategy A: Randomly pick a value from a discrete
set of categories with equal probability (here, gender). Randomly pick a student from
that category. Sample 300 students using this sampling strategy. Repeat 5 times.
Report gender counts. [2]

• Repeat the above sampling process with number of students as 300, 600, 900, 1200,
1500. Plot a histogram for average maximum relative difference in gender counts v/s
the number of students sampled across 10 repeats. [2]

5
def g e t _ g e n d e r _ b a l a n c e d _ c o u n t s ( self , n : int = 300 , repeats : int = 5)
-> list [ dict [ str , int ]]: ...
def s a m p l e _ g e n d e r _ u n i f o r m _ r a n d o m ( self , n : int = 300 , repeats : int =
5) -> list [ dict [ str , int ]]:
def p l o t _ a v g _ m a x _ g e n d e r _ d i f f _ v s _ s a m p l e _ s i z e ( self , sample_sizes : list
[ int ] , repeats : int = 10) -> None : ...

Q1.4 GPA-Uniform Cohort [3 marks]

• Using Sampling Strategy A, select 100 students such that their GPA values are ap-
proximately uniformly distributed across 10 bins. [1]

• Plot GPA histogram and compare to original dataset’s histogram. [1]

• Did you sample with or without replacement? Why? [1]

def sample _g pa_ un if or m ( self , n : int = 100 , bins : int = 10) -> pd .
DataFrame : ...
def p l o t _ g p a _ h i s t o g r a m _ c o m p a r i s o n ( self , sampled_df : pd . DataFrame ) ->
None : ...

Q1.5 Program-Major Balanced Cohort [3 marks]

• Using Sampling Strategy A, select 60 students such that all valid (program, major)
combinations are represented approximately equally. [1]

• Show counts and heatmap. [1]

• Were any groups too small? How did you handle it? [1]

def s a m p l e _ p r o g r a m _ m a j o r _ b a l a n c e d ( self , n : int ) -> pd . DataFrame : ...

def s h o w _ p r o g r a m _ m a j o r _ c o u n t s _ a n d _ h e a t m a p ( self , sampled_df : pd .
DataFrame ) -> None : ...

Q2.0 k-Nearest Neighbors [30 marks]

Use k-NN (from sklearn) to predict gender based on student features. First, implement the
following helper class for feature transformations.
class Per F e a t u r e T r a n s f o r m e r :
def __init__ ( self ) :
""" Initializes memory for per - feature transformers . """
...

6
def fit ( self , df : pd . DataFrame , params : dict [ str , str ]) -> None :
""" Fits transformers for each feature based on the given
type .
Parameters :
df : The dataframe containing features to be transformed .
params : A dictionary mapping feature name to
transformation type ,
e . g . , {" GPA ": " standard " , " major ": " ordinal " , " program ":
" onehot "}.
"""
...

def transform ( self , df : pd . DataFrame ) -> np . ndarray :

""" Applies the fitted transformers to the corresponding
features and returns a NumPy array . """
...

def fit_transform ( self , df : pd . DataFrame , params : dict [ str , str

]) -> np . ndarray :
""" Fits and transforms all features in one step using the
given transformation parameters . """
...

Now, implement the following class for predicting gender using KNN.
class KNNGed erPred ictor :
def __init__ ( self , student_df : pd . DataFrame , username : str ) :
""" Initializes the predictor with the full student dataset .
Use the username for plots . """
...

def t r a i n _ v a l _ t e s t _ s p l i t ( self , test_size : float = 0.2 , val_size :

float = 0.2 , seed : int = 42) -> tuple [ pd . DataFrame , pd .
DataFrame , pd . DataFrame ]:
...

def g e t _ f e a t u r e _ m a t r i x _ a n d _ l a b e l s ( self , df : pd . DataFrame ,

features : list [ str ]) -> tuple [ np . ndarray , np . ndarray ]:
"""
Extract selected features and gender labels from the
DataFrame .
Applies encoding to categorical variables and normalizes
numeric features . Do not fit encoders or scalers on test
data . Only transform using previously fitted ones .
"""

7
def g e t _ k n n _ a c c u r a c y _ v s _ k ( self , k_values : list [ int ] , distance :
str = " euclidean " ) -> list [ float ]:
""" Calculates accuracy scores for various k values on the
validation set . """
...

def p l o t _ k n n _ a c c u r a c y _ v s _ k ( self , k_values : list [ int ] , distance :

str = " euclidean " ) -> None :
""" Plots accuracy scores against k values on the validation
set . """
...

def ge t_ kn n_ f1_ he at ma p ( self , k_values : list [ int ] , distances :

list [ str ]) -> pd . DataFrame :
""" Returns a dataframe with the f1 - score for each
combination on the validation set """
...

def p l ot _ k nn _ f 1_ h e at m a p ( self , f1_scores_df : pd . DataFrame ) ->

None : ...

def g e t _ k n n _ f 1 _ s i n g l e _ f e a t u r e _ t a b l e ( self , k_values : list [ int ] ,

features : list [ str ] , distance : str = " euclidean " ) -> pd .
DataFrame :
""" Creates a table of F1 scores on the test set using only a
single feature for prediction . """
...

Perform the following tasks.

• Train/val/test split the dataset and apply the data transforms. [4]

• What value of k (odd values from 1 to 21) gave the highest accuracy on the validation
set with Euclidean distance metric? Justify with a plot. [2]

• Repeat the above for distance metrics like Manhattan and Cosine Similarity. [4]

• Report the validation F-1 score vs k for all the three distance metrics. [4]

• Plot a heatmap: k × distance function vs F-1 score. [4]

• Which distance metric performs better? Why might that be? [2]

• Instead of using all student features, an alternative is to use a single feature for pre-
diction. Create an F-1 score table where rows are various values of k, columns are the
single features used. Report values for test set for all the distance metrics. [6]

• Which single feature performed the best? How does it compare with the result using
all the features? Why? [4]

8
Q3.0 Linear Regression with Regularization [30 marks]
You will predict GPA using student features. Use a validation set to select the hyperparam-
eters.
Start with a function of the following form:
def run_p o ly _ r eg r e ss i o n ( X_train , y_train ,
X_val , y_val ,
X_test , y_test ,
degree =1 ,
regularizer = None ,
reg_strength =0.0) :
"""
Fit a polynomial regression model with optional regularization .

Parameters :
degree ( int ) : Degree of the polynomial to fit
regularizer ( str or None ) : ’ l1 ’, ’ l2 ’, or None
reg_strength ( float ) : Regularization coefficient ( alpha )

Returns :
dict with train , val , and test MSEs , and learned
coefficients
"""

Perform the following tasks.

• For three setups - no regularization, L1 and L2 regularization, repeat the below steps:
[8×3=24]

– Fit polynomial regression models across degrees 1 to 6 [2]

– Plot polynomial degree vs MSE (on train and validation sets). Describe the trend
you observe as degree increases. [3]
– For each degree, use val MSE to choose the best regularization strength. [1]
– Plot regularization strength (log scale) vs val MSE for best degree. [2]

• Comment on performance improvement (if any) from regularization. Which overall

experimental setup (degree, regularizer) yielded the best test performance? [3]

• For the best setup using L1 regularization, which features had non-zero weights?
List the most important predictors for GPA. Repeat the same with L2 regularization.
Comment on the differences. [3]

Python Student Management System
No ratings yet
Python Student Management System
2 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
Lab 13
No ratings yet
Lab 13
5 pages
SC Cat
No ratings yet
SC Cat
6 pages
Lab 9
No ratings yet
Lab 9
2 pages
AI Project: Real-World Data Classification
No ratings yet
AI Project: Real-World Data Classification
6 pages
XII - IP - Practical - List 2023-24
No ratings yet
XII - IP - Practical - List 2023-24
4 pages
Data Analysis and Data Science Task - 1
No ratings yet
Data Analysis and Data Science Task - 1
3 pages
IP PROJECT Important
No ratings yet
IP PROJECT Important
15 pages
Class12 IP Practical File
No ratings yet
Class12 IP Practical File
7 pages
Class XI AI Assignments 25-26
No ratings yet
Class XI AI Assignments 25-26
2 pages
DSBDAlab Manual
No ratings yet
DSBDAlab Manual
116 pages
AI Lab Manual
No ratings yet
AI Lab Manual
18 pages
Python Worksheet
No ratings yet
Python Worksheet
3 pages
Ankit Python
No ratings yet
Ankit Python
26 pages
Data Science & Big Data Lab Guide
No ratings yet
Data Science & Big Data Lab Guide
167 pages
DSBDA LAB - MANUAL (Autosaved) - Sd1-Converted-1-2
100% (1)
DSBDA LAB - MANUAL (Autosaved) - Sd1-Converted-1-2
256 pages
XII IP Practical List 2023-24
No ratings yet
XII IP Practical List 2023-24
4 pages
C201 - Project Description and Guidelines
No ratings yet
C201 - Project Description and Guidelines
4 pages
S24 - Bigdata Lab Final 005
No ratings yet
S24 - Bigdata Lab Final 005
9 pages
ProjectINSE6220 Fall23
No ratings yet
ProjectINSE6220 Fall23
1 page
Informatics Practices Practical
No ratings yet
Informatics Practices Practical
32 pages
Yashica IP Practical
No ratings yet
Yashica IP Practical
51 pages
Data Science for Engineers Course
No ratings yet
Data Science for Engineers Course
8 pages
Dav 2024 Pyq
No ratings yet
Dav 2024 Pyq
7 pages
Python App for University Student Records
No ratings yet
Python App for University Student Records
5 pages
Shivika Agarwal Practical File
No ratings yet
Shivika Agarwal Practical File
21 pages
Task2 - Colaboratory
No ratings yet
Task2 - Colaboratory
3 pages
Eda Syllabus
No ratings yet
Eda Syllabus
3 pages
Lab 3 & 4
No ratings yet
Lab 3 & 4
10 pages
Kanav Dhingra XID RollNo18 AI Prac2025
No ratings yet
Kanav Dhingra XID RollNo18 AI Prac2025
16 pages
Computational
No ratings yet
Computational
7 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Xi Ai Final Practcial File
No ratings yet
Xi Ai Final Practcial File
3 pages
1st Project
No ratings yet
1st Project
9 pages
Cs Sem III Dav Upc 2343012002 Sl. No. Qp. 1673 Dec '23
No ratings yet
Cs Sem III Dav Upc 2343012002 Sl. No. Qp. 1673 Dec '23
12 pages
Student Data Analysis in Python
No ratings yet
Student Data Analysis in Python
3 pages
Big Data Python Assignment Guide
No ratings yet
Big Data Python Assignment Guide
4 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
Data Analysis Exam for CS Majors
No ratings yet
Data Analysis Exam for CS Majors
12 pages
End Sem PYQ
No ratings yet
End Sem PYQ
8 pages
FIT1043 A2 Specification - S2 2024 - Gks6arg
No ratings yet
FIT1043 A2 Specification - S2 2024 - Gks6arg
5 pages
HW 4
No ratings yet
HW 4
13 pages
Aissce Ip Practical Examination 2024-25
No ratings yet
Aissce Ip Practical Examination 2024-25
5 pages
Ai Class 12 Practical 2
0% (1)
Ai Class 12 Practical 2
21 pages
Python Practical Questions@Subas
No ratings yet
Python Practical Questions@Subas
7 pages
Practice Assignment - 1-Class XI AI
No ratings yet
Practice Assignment - 1-Class XI AI
2 pages
XII IP Practical File - 2023-24upto June
No ratings yet
XII IP Practical File - 2023-24upto June
6 pages
Math 189 HW-1: Data Analysis with Pandas
No ratings yet
Math 189 HW-1: Data Analysis with Pandas
11 pages
Record Ip Mithun
No ratings yet
Record Ip Mithun
25 pages
Ak ML
No ratings yet
Ak ML
8 pages
Project Guidelines (ISE-291 - T 241)
No ratings yet
Project Guidelines (ISE-291 - T 241)
3 pages
Data Science Lab Manual 2023-24
No ratings yet
Data Science Lab Manual 2023-24
26 pages
XII IP Practical File 1 Complete
No ratings yet
XII IP Practical File 1 Complete
38 pages
CS101. A15 - Collatz Conjecture
No ratings yet
CS101. A15 - Collatz Conjecture
12 pages
Pa2 (Ab)
No ratings yet
Pa2 (Ab)
6 pages
AI Assignment 1&2 PDF
No ratings yet
AI Assignment 1&2 PDF
12 pages
Shiv - Project Report - Final
No ratings yet
Shiv - Project Report - Final
29 pages
A Project On Employee Motivation by Shahid KV, Chavakkad
92% (180)
A Project On Employee Motivation by Shahid KV, Chavakkad
60 pages
Chapter 5
No ratings yet
Chapter 5
10 pages
N G Das Chap5&6 Question
No ratings yet
N G Das Chap5&6 Question
11 pages
Youth Gangs 1 Research Proposal Youth Gang "Influence and Behavior of Youth Gang"
100% (4)
Youth Gangs 1 Research Proposal Youth Gang "Influence and Behavior of Youth Gang"
12 pages
The Basic Practice of Statistics 9th Edition PDF
0% (1)
The Basic Practice of Statistics 9th Edition PDF
36 pages
Open Sense, Nano-Tera 2013
No ratings yet
Open Sense, Nano-Tera 2013
67 pages
Audit Sampling for Auditors
No ratings yet
Audit Sampling for Auditors
30 pages
Ms-8-Previous Questions June-2014 Dec 2018
No ratings yet
Ms-8-Previous Questions June-2014 Dec 2018
30 pages
FA Part 2 and Icom Part 2 Smart Syllabus 2020
No ratings yet
FA Part 2 and Icom Part 2 Smart Syllabus 2020
51 pages
Module 1: Nature of Statistics
No ratings yet
Module 1: Nature of Statistics
47 pages
ME RDQA Guidelines en
No ratings yet
ME RDQA Guidelines en
24 pages
Amazon Strategy Analysis
No ratings yet
Amazon Strategy Analysis
29 pages
Water Sanitation and Depression in Rural Communities Evidence From Nationally Representative Study Data in South Afri
No ratings yet
Water Sanitation and Depression in Rural Communities Evidence From Nationally Representative Study Data in South Afri
10 pages
Sensory Shelf-Life Insights
No ratings yet
Sensory Shelf-Life Insights
11 pages
Als Effectiveness Final
No ratings yet
Als Effectiveness Final
33 pages
Media Studies Dissertation Help
100% (2)
Media Studies Dissertation Help
7 pages
AT.2813 - Determining The Extent of Testing PDF
No ratings yet
AT.2813 - Determining The Extent of Testing PDF
7 pages
Calculation of The Variance: Methods and Situations.: Raw Data
No ratings yet
Calculation of The Variance: Methods and Situations.: Raw Data
3 pages
Kenyawaterandsanitation IRBproposal
No ratings yet
Kenyawaterandsanitation IRBproposal
17 pages
Eskedar Tadesse Final
No ratings yet
Eskedar Tadesse Final
48 pages
GROUP 1 Research FINAL
No ratings yet
GROUP 1 Research FINAL
40 pages
A Study To Assessthe Level OfInstagram Reels Addiction Among Adolescents in Selected Colleges at Aluva
No ratings yet
A Study To Assessthe Level OfInstagram Reels Addiction Among Adolescents in Selected Colleges at Aluva
24 pages
BBA Marketing Research Report
No ratings yet
BBA Marketing Research Report
96 pages
Sloboda, 2001
No ratings yet
Sloboda, 2001
24 pages
Customer Satisfaction at Retail Fresh Delhi
No ratings yet
Customer Satisfaction at Retail Fresh Delhi
93 pages
pr1 (IMRAD FORMAT)
No ratings yet
pr1 (IMRAD FORMAT)
10 pages
Complete Bundle Basic Business Statistics Australian 4th Edition Berenson
No ratings yet
Complete Bundle Basic Business Statistics Australian 4th Edition Berenson
405 pages
AP Government Chapter 7 Reading Log
No ratings yet
AP Government Chapter 7 Reading Log
6 pages
Methodology - Small Holder Project Proposal
No ratings yet
Methodology - Small Holder Project Proposal
23 pages

Assignment 1

Uploaded by

Assignment 1

Uploaded by

Statistical Methods in AI - Monsoon 2025

AI Tools Usage Instructions (Mandatory if Applicable)

• Use your institute email ID to generate a personalized random seed:

– Get the part before @ in your IIITH email

Submission Policy (GitHub Classroom Assignments)

• Commit Quality: Commits must reflect actual progress. Non-informative or place-

2. Use appropriate visualization libraries such as matplotlib, seaborn, or plotly. Each

3. Keep visualization and computation logic in separate functions.

4. Add docstrings to methods explaining what they do.

5. Use reproducible random sampling using the given seed.

Q1.0 Dataset Generation [6 marks]

• major: B.Tech (70%), MS (20%), PhD (10%) [1]

• program: distribution conditioned on major: [2]

Major CSE ECE CHD CND

• GPA: Normally distributed by major, clipped to [4.0, 10.0] [2]

Major GPA Distribution

You must implement a class as follows.

Q1.1 Dataset Analysis

• program conditioned on major Grouped Bar Chart [1]

• GPA conditioned on major [1]

• GPA conditioned on program [1]

• GPA conditioned on program and major [2]

• Summary of entire dataset(e.g. pairplots) [3]

You may implement the following methods:

Report the results and briefly comment on any observations.

(c) Program-Major Combinations [2 marks]

Report the counts and describe any patterns you observe.

Q1.2 Simple vs Stratified Sampling [5 marks]

• Repeat using stratified sampling by major. Compare results. [2]

• Which method has lower std deviation? Why? [1]

def g et _g p a _ m e a n _ s t d _ r a n d o m ( self , n : int = 500 , repeats : int = 50)

Q1.3 Gender-Balanced Cohort [5 marks]

Q1.4 GPA-Uniform Cohort [3 marks]

• Plot GPA histogram and compare to original dataset’s histogram. [1]

• Did you sample with or without replacement? Why? [1]

Q1.5 Program-Major Balanced Cohort [3 marks]

• Show counts and heatmap. [1]

def s a m p l e _ p r o g r a m _ m a j o r _ b a l a n c e d ( self , n : int ) -> pd . DataFrame : ...

Q2.0 k-Nearest Neighbors [30 marks]

def transform ( self , df : pd . DataFrame ) -> np . ndarray :

def fit_transform ( self , df : pd . DataFrame , params : dict [ str , str

def t r a i n _ v a l _ t e s t _ s p l i t ( self , test_size : float = 0.2 , val_size :

def g e t _ f e a t u r e _ m a t r i x _ a n d _ l a b e l s ( self , df : pd . DataFrame ,

def p l o t _ k n n _ a c c u r a c y _ v s _ k ( self , k_values : list [ int ] , distance :

def ge t_ kn n_ f1_ he at ma p ( self , k_values : list [ int ] , distances :

def p l ot _ k nn _ f 1_ h e at m a p ( self , f1_scores_df : pd . DataFrame ) ->

def g e t _ k n n _ f 1 _ s i n g l e _ f e a t u r e _ t a b l e ( self , k_values : list [ int ] ,

Perform the following tasks.

• Plot a heatmap: k × distance function vs F-1 score. [4]

Perform the following tasks.

– Fit polynomial regression models across degrees 1 to 6 [2]

• Comment on performance improvement (if any) from regularization. Which overall

You might also like