ML Assignment
Theory and Numericals
1. Analyze the limitations of traditional neural networks.
2. A classifier achieves the following confusion matrix on a test dataset:
Predicted Predicted
Positive Negative
Actual Positive 40 10
Actual Negative 20 30
Calculate precision, recall, F1 score, and
accuracy.
Derive the mathematical formula for the F1 score and explain its relationship with
precision and recall.
3. A dataset contains the following transactions:
o T1: {A, B, C}
o T2: {A, C}
o T3: {A, B}
o T4: {A, B, C, D}
o T5: {B, C, D} Find all frequent itemsets using the Apriori algorithm with a
minimum support of 0.6.
o Extend the above example to generate association rules with a minimum
confidence of 0.8.
4. Decision Trees
5. Open ended
• Why is it not always ideal to achieve zero training error in a machine learning model?
Explain with examples.
• If adding more data does not improve the performance of a machine learning model, what
could be the reasons? Propose solutions.
• Can a classifier with 100% accuracy always be considered the best? Discuss scenarios
where this may not hold true.
• How does feature scaling impact the performance of algorithms like K-Nearest
Neighbors and Support Vector Machines? Provide insights.
6. K-Means Clustering Process:
Using the given data: [2,4,10,12,3,20,30,11,25] [2, 4, 10, 12, 3, 20, 30, 11, 25],
perform K-Means clustering with K=2
Tasks:
1. Perform two iterations of the K-Means algorithm and report the cluster
assignments after each iteration.
2. Calculate the final centroids of the clusters.
3. Explain why the clusters remain stable or change during each iteration.
Coding Assignment
Github API Community Clustering
Write a Python script to:
1. Data Collection:
o Use the GitHub API to fetch user profile data for a list of users.
o Collect information about each user's repositories, programming languages, and
followers.
2. Data Processing:
o Create a dataset where each user is represented by the programming languages
they most frequently use.
o Encode the programming languages as features.
3. Clustering:
o Apply K-Means clustering to group users based on their programming language
preferences.
o Visualize the clusters using a 2D scatter plot (if dimensionality reduction is
needed, use PCA).
4. Community Insights:
o Identify the main programming languages in each cluster.
o Provide a brief analysis of how the clusters represent communities of users who
code in similar languages.
Deliverables:
• Python script (.py file) with clear comments and modular code.
• A README file explaining how to run the script and interpret the results.
• A visualization of the clusters and summary insights.
REFERENCES
Adobe Acrobat
Document