[go: up one dir, main page]

0% found this document useful (0 votes)
26 views9 pages

Assignment 1

Uploaded by

iorikitahara77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views9 pages

Assignment 1

Uploaded by

iorikitahara77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Statistical Methods in AI - Monsoon 2025

Assignment 1
Deadline : 26 August 2025 11:59 P.M.
Instructors: Prof Ravi Kiran Sarvadevabhatla, Prof Saikiran Bulusu

General Instructions
• Your assignment must be implemented in Python.
• Clearly label and organize your code, including comments that explain the purpose of
each section and key steps in your implementation. Ensure that your files are well-
structured, with headings, subheadings, and explanations as necessary.
• Make sure to test your code thoroughly before submission to avoid any runtime errors
or unexpected behavior.
• Your assignment will be evaluated not only based on correctness but also on the quality
of code, the clarity of explanations, and the extent to which you’ve understood and
applied the concepts covered in the course.
• We are aware of the possibility of submitting the assignment late in GitHub Classroom
using various hacks. Note that we will have measures in place accordingly and anyone
caught attempting to do the same will be given a straight zero in the assignment.

AI Tools Usage Instructions (Mandatory if Applicable)


We are aware how easy it is to write code and solve questions with the help of LLM services,
but we strongly encourage you to figure out the answers yourself. If you use any AI tools
such as ChatGPT, Gemini, Claude, etc., to assist in solving any part of this assignment:
• If you are unable to explain any part of the solution/code during evalu-
ations, that solution/code will be considered plagiarized and you will be
penalized. You must also be able to briefly explain how you modified or verified the
AI-generated content.
• You must include a shareable link to the AI Tool chat history or a screenshot
of the relevant conversation. Do NOT share any private or sensitive personal
information in the AI Tool conversations you include.

1
Submission Instructions
• Submit a self-contained Jupyter notebook containing all code, plots, and printed tables.

• Report all the analysis, comparison and any metrics in the notebook or a separate report
that is part of the submission itself. No external links to cloud storage files, wandb logs
or any other alternate will be accepted as part of your submission. Only the values and
visualizations as part of your commits will be graded.

• Use your institute email ID to generate a personalized random seed:

– Get the part before @ in your IIITH email


e.g., username from username@program.iiit.ac.in.
– Use the SHA-256 hash of this string to ensure uniqueness:
import hashlib
seed = int ( hashlib . sha256 ( username . encode () ) . hexdigest () , 16)
% (2**32)

– Use this seed in all random number generators (e.g., np.random.default rng(seed)).

• All plots must include your email username in the title or filename
plt . text (
0.95 , 0.95 , " username " ,
ha = ’ right ’ , va = ’ top ’ ,
transform = plt . gca () . transAxes ,
fontsize =10 , color = ’ gray ’ , alpha =0.7
)

Submission Policy (GitHub Classroom Assignments)


To encourage consistent progress and discourage last-minute submissions, the following pol-
icy applies:

• Minimum Progress Requirement: You must push at least two meaningful com-
mits on different days prior to the deadline.

• Commit Timing Check: If over 80% of commits are made within the last 24 hours
before the deadline, a 10% penalty will be applied.

• Commit Quality: Commits must reflect actual progress. Non-informative or place-


holder commits will not count.

• Final Submission: Your final grade is based on the latest commit before the deadline,
but commit history will be reviewed.

2
Guidelines for Implementation and Code Design
1. Use object-oriented programming. For Q1, all sampling and analysis must use the
original dataset stored in the StudentDataset object.

2. Use appropriate visualization libraries such as matplotlib, seaborn, or plotly. Each


plot must include a title, x-label, y-label, and legend (if applicable). Answers
with missing labels or legends will receive 0 marks.

3. Keep visualization and computation logic in separate functions.

4. Add docstrings to methods explaining what they do.

5. Use reproducible random sampling using the given seed.

Q1.0 Dataset Generation [6 marks]


Generate 10,000 student records with the following attributes:
• gender: Male (65%), Female (33%), Other (2%) [1]

• major: B.Tech (70%), MS (20%), PhD (10%) [1]

• program: distribution conditioned on major: [2]

Major CSE ECE CHD CND


B.Tech 40% 40% 10% 10%
MS 30% 30% 20% 20%
PhD 25% 25% 25% 25%

• GPA: Normally distributed by major, clipped to [4.0, 10.0] [2]

Major GPA Distribution


B.Tech N (7.0, 1.0)
MS N (8.0, 0.7)
PhD N (8.3, 0.5)

You must implement a class as follows.


class StudentDataset :
def __init__ ( self , num_students : int , seed : int ) :
# Generates the full dataset during initialization using the
specified number of students and seed .
def ge t_ fu ll _da ta fr am e ( self ) -> pd . DataFrame :
# Do not regenerate the dataset in different methods or cells .
# Use this method to access the full dataset consistently .

3
def generate_gender ( self ) -> list [ str ]: ...
def generate_major ( self ) -> list [ str ]: ...
def generate_program ( self , majors : list [ str ]) -> list [ str ]: ...
def generate_gpa ( self , majors : list [ str ]) -> list [ float ]: ...
def as se mb le _da ta fr am e ( self ) -> pd . DataFrame : ... # Assemble the
full dataset from gender , major , program , and GPA .

Q1.1 Dataset Analysis


(a) Visualizations [15 marks]
Create suitable visualizations for the following distributions. The visualizations should con-
vey meaningful information about the data.

• gender [1]

• major [1]

• program [1]

• GPA [1]

• program conditioned on major Grouped Bar Chart [1]

• GPA conditioned on major [1]

• GPA conditioned on program [1]

• GPA conditioned on program and major [2]

• gender, major, program and GPA of 100 randomly sampled students [3]

• Summary of entire dataset(e.g. pairplots) [3]

You may implement the following methods:


def p l ot _ g e n d e r _ d i s t r i b u t i o n ( self ) -> None : ...
def p lo t_ m a j o r _ d i s t r i b u t i o n ( self ) -> None : ...
def p l o t _ p r o g r a m _ d i s t r i b u t i o n ( self ) -> None : ...
def p lot_ g p a _ d i s t r i b u t i o n ( self , bins : int = 20) -> None : ...
def p lot_ p r o g r a m _ b y _ m a j o r ( self ) -> None : ...
def plot_gp a_by_m ajor ( self ) -> None : ...
def plot_ g pa _ b y_ p r og r a m ( self ) -> None : ...
def p l o t _ g p a _ b y _ p r o g r a m _ a n d _ m a j o r ( self ) -> None : ...
def plot_ s a m p l e d _ d a t a s e t ( self ) -> None : ...
def p l o t _ e n t i r e _ d a t a s e t _ s u m m a r y ( self ) -> None : ...

4
(b) GPA Summary Statistics [1 mark]
Define a method to compute the mean and standard deviation of GPA:
def gpa_mean_std ( self ) -> tuple [ float , float ]: ...

Report the results and briefly comment on any observations.

(c) Program-Major Combinations [2 marks]


Define a method to count the number of students for each unique (program, major) pairs.
Also write a method to visualize it with a heatmap.
def c o u n t _ s t u d e n t s _ p e r _ p r o g r a m _ m a j o r _ p a i r ( self ) -> pd . Dataframe : ...
def v i s u a l i z e _ s t u d e n t s _ p e r _ p r o g r a m _ m a j o r _ p a i r ( self , counts_df : pd .
Dataframe ) -> None : ...

Report the counts and describe any patterns you observe.

Q1.2 Simple vs Stratified Sampling [5 marks]


• Sample 500 students uniformly at random. Repeat 50 times. Estimate mean GPA and
standard deviation. [2]

• Repeat using stratified sampling by major. Compare results. [2]

• Which method has lower std deviation? Why? [1]

def g et _g p a _ m e a n _ s t d _ r a n d o m ( self , n : int = 500 , repeats : int = 50)


-> tuple [ float , float ]: ...
def g e t _ g p a _ m e a n _ s t d _ s t r a t i f i e d ( self , n : int = 500 , repeats : int =
50) -> tuple [ float , float ]: ...

Q1.3 Gender-Balanced Cohort [5 marks]


• Sample 300 students with exact same representation across genders. Repeat 5 times.
Report gender counts. [1]

• Consider the following Sampling Strategy A: Randomly pick a value from a discrete
set of categories with equal probability (here, gender). Randomly pick a student from
that category. Sample 300 students using this sampling strategy. Repeat 5 times.
Report gender counts. [2]

• Repeat the above sampling process with number of students as 300, 600, 900, 1200,
1500. Plot a histogram for average maximum relative difference in gender counts v/s
the number of students sampled across 10 repeats. [2]

5
def g e t _ g e n d e r _ b a l a n c e d _ c o u n t s ( self , n : int = 300 , repeats : int = 5)
-> list [ dict [ str , int ]]: ...
def s a m p l e _ g e n d e r _ u n i f o r m _ r a n d o m ( self , n : int = 300 , repeats : int =
5) -> list [ dict [ str , int ]]:
def p l o t _ a v g _ m a x _ g e n d e r _ d i f f _ v s _ s a m p l e _ s i z e ( self , sample_sizes : list
[ int ] , repeats : int = 10) -> None : ...

Q1.4 GPA-Uniform Cohort [3 marks]


• Using Sampling Strategy A, select 100 students such that their GPA values are ap-
proximately uniformly distributed across 10 bins. [1]

• Plot GPA histogram and compare to original dataset’s histogram. [1]

• Did you sample with or without replacement? Why? [1]

def sample _g pa_ un if or m ( self , n : int = 100 , bins : int = 10) -> pd .
DataFrame : ...
def p l o t _ g p a _ h i s t o g r a m _ c o m p a r i s o n ( self , sampled_df : pd . DataFrame ) ->
None : ...

Q1.5 Program-Major Balanced Cohort [3 marks]


• Using Sampling Strategy A, select 60 students such that all valid (program, major)
combinations are represented approximately equally. [1]

• Show counts and heatmap. [1]

• Were any groups too small? How did you handle it? [1]

def s a m p l e _ p r o g r a m _ m a j o r _ b a l a n c e d ( self , n : int ) -> pd . DataFrame : ...


def s h o w _ p r o g r a m _ m a j o r _ c o u n t s _ a n d _ h e a t m a p ( self , sampled_df : pd .
DataFrame ) -> None : ...

Q2.0 k-Nearest Neighbors [30 marks]


Use k-NN (from sklearn) to predict gender based on student features. First, implement the
following helper class for feature transformations.
class Per F e a t u r e T r a n s f o r m e r :
def __init__ ( self ) :
""" Initializes memory for per - feature transformers . """
...

6
def fit ( self , df : pd . DataFrame , params : dict [ str , str ]) -> None :
""" Fits transformers for each feature based on the given
type .
Parameters :
df : The dataframe containing features to be transformed .
params : A dictionary mapping feature name to
transformation type ,
e . g . , {" GPA ": " standard " , " major ": " ordinal " , " program ":
" onehot "}.
"""
...

def transform ( self , df : pd . DataFrame ) -> np . ndarray :


""" Applies the fitted transformers to the corresponding
features and returns a NumPy array . """
...

def fit_transform ( self , df : pd . DataFrame , params : dict [ str , str


]) -> np . ndarray :
""" Fits and transforms all features in one step using the
given transformation parameters . """
...

Now, implement the following class for predicting gender using KNN.
class KNNGed erPred ictor :
def __init__ ( self , student_df : pd . DataFrame , username : str ) :
""" Initializes the predictor with the full student dataset .
Use the username for plots . """
...

def t r a i n _ v a l _ t e s t _ s p l i t ( self , test_size : float = 0.2 , val_size :


float = 0.2 , seed : int = 42) -> tuple [ pd . DataFrame , pd .
DataFrame , pd . DataFrame ]:
...

def g e t _ f e a t u r e _ m a t r i x _ a n d _ l a b e l s ( self , df : pd . DataFrame ,


features : list [ str ]) -> tuple [ np . ndarray , np . ndarray ]:
"""
Extract selected features and gender labels from the
DataFrame .
Applies encoding to categorical variables and normalizes
numeric features . Do not fit encoders or scalers on test
data . Only transform using previously fitted ones .
"""

7
def g e t _ k n n _ a c c u r a c y _ v s _ k ( self , k_values : list [ int ] , distance :
str = " euclidean " ) -> list [ float ]:
""" Calculates accuracy scores for various k values on the
validation set . """
...

def p l o t _ k n n _ a c c u r a c y _ v s _ k ( self , k_values : list [ int ] , distance :


str = " euclidean " ) -> None :
""" Plots accuracy scores against k values on the validation
set . """
...

def ge t_ kn n_ f1_ he at ma p ( self , k_values : list [ int ] , distances :


list [ str ]) -> pd . DataFrame :
""" Returns a dataframe with the f1 - score for each
combination on the validation set """
...

def p l ot _ k nn _ f 1_ h e at m a p ( self , f1_scores_df : pd . DataFrame ) ->


None : ...

def g e t _ k n n _ f 1 _ s i n g l e _ f e a t u r e _ t a b l e ( self , k_values : list [ int ] ,


features : list [ str ] , distance : str = " euclidean " ) -> pd .
DataFrame :
""" Creates a table of F1 scores on the test set using only a
single feature for prediction . """
...

Perform the following tasks.


• Train/val/test split the dataset and apply the data transforms. [4]

• What value of k (odd values from 1 to 21) gave the highest accuracy on the validation
set with Euclidean distance metric? Justify with a plot. [2]

• Repeat the above for distance metrics like Manhattan and Cosine Similarity. [4]

• Report the validation F-1 score vs k for all the three distance metrics. [4]

• Plot a heatmap: k × distance function vs F-1 score. [4]

• Which distance metric performs better? Why might that be? [2]

• Instead of using all student features, an alternative is to use a single feature for pre-
diction. Create an F-1 score table where rows are various values of k, columns are the
single features used. Report values for test set for all the distance metrics. [6]

• Which single feature performed the best? How does it compare with the result using
all the features? Why? [4]

8
Q3.0 Linear Regression with Regularization [30 marks]
You will predict GPA using student features. Use a validation set to select the hyperparam-
eters.
Start with a function of the following form:
def run_p o ly _ r eg r e ss i o n ( X_train , y_train ,
X_val , y_val ,
X_test , y_test ,
degree =1 ,
regularizer = None ,
reg_strength =0.0) :
"""
Fit a polynomial regression model with optional regularization .

Parameters :
degree ( int ) : Degree of the polynomial to fit
regularizer ( str or None ) : ’ l1 ’, ’ l2 ’, or None
reg_strength ( float ) : Regularization coefficient ( alpha )

Returns :
dict with train , val , and test MSEs , and learned
coefficients
"""

Perform the following tasks.

• For three setups - no regularization, L1 and L2 regularization, repeat the below steps:
[8×3=24]

– Fit polynomial regression models across degrees 1 to 6 [2]


– Plot polynomial degree vs MSE (on train and validation sets). Describe the trend
you observe as degree increases. [3]
– For each degree, use val MSE to choose the best regularization strength. [1]
– Plot regularization strength (log scale) vs val MSE for best degree. [2]

• Comment on performance improvement (if any) from regularization. Which overall


experimental setup (degree, regularizer) yielded the best test performance? [3]

• For the best setup using L1 regularization, which features had non-zero weights?
List the most important predictors for GPA. Repeat the same with L2 regularization.
Comment on the differences. [3]

You might also like