T
J EC
O
PR
U LE
M OD
SL
AIML MODULE
PROJECT
©Great Learning. Proprietary content. All Rights Reserved. Unauthorised use or distribution prohibited
AIML MODULE PROJECT
Supervised Learning TOTAL
SCORE 60
Part A - 30 Marks
• DOMAIN: Medical
• CONTEXT: Medical research university X is undergoing a deep research on patients with certain conditions. University has an internal AI team.
Due to con identiality the patient’s details and the conditions are masked by the client by providing different datasets to the AI team for
developing a AIML model which can predict the condition of the patient depending on the received test results.
• DATA DESCRIPTION: The data consists of biomechanics features of the patients according to their current conditions. Each patient is
represented in the data set by six biomechanics attributes derived from the shape and orientation of the condition to their body part.
• PROJECT OBJECTIVE: To Demonstrate the ability to fetch, process and leverage data to generate useful predictions by training Supervised
Learning algorithms.
• STEPS AND TASK [30 Marks]:
1. Data Understanding: [5 Marks]
A. Read all the 3 CSV iles as DataFrame and store them into 3 separate variables. [1 Mark]
B. Print Shape and columns of all the 3 DataFrames. [1 Mark]
C. Compare Column names of all the 3 DataFrames and clearly write observations. [1 Mark]
D. Print DataTypes of all the 3 DataFrames. [1 Mark]
E. Observe and share variation in ‘Class’ feature of all the 3 DaraFrames. [1 Mark]
2. Data Preparation and Exploration: [5 Marks]
A. Unify all the variations in ‘Class’ feature for all the 3 DataFrames. [1 Marks]
For Example: ‘tp_s’, ‘Type_S’, ‘type_s’ should be converted to ‘type_s’
B. Combine all the 3 DataFrames to form a single DataFrame [1 Marks]
Checkpoint: Expected Output shape = (310,7)
C. Print 5 random samples of this DataFrame [1 Marks]
D. Print Feature-wise percentage of Null values. [1 Mark]
E. Check 5-point summary of the new DataFrame. [1 Mark]
3. Data Analysis: [10 Marks]
A. Visualize a heatmap to understand correlation between all features [2 Marks]
B. Share insights on correlation. [2 Marks]
A. Features having stronger correlation with correlation value.
B. Features having weaker correlation with correlation value.
C. Visualize a pairplot with 3 classes distinguished by colors and share insights. [2 Marks]
D. Visualize a jointplot for ‘P_incidence’ and ‘S_slope’ and share insights. [2 Marks]
E. Visualize a boxplot to check distribution of the features and share insights. [2 Marks]
4. Model Building: [6 Marks]
A. Split data into X and Y. [1 Marks]
B. Split data into train and test with 80:20 proportion. [1 Marks]
C. Train a Supervised Learning Classi ication base model using KNN classi ier. [2 Marks]
D. Print all the possible performance metrics for both train and test data. [2 Marks]
5. Performance Improvement: [4 Marks]
A. Experiment with various parameters to improve performance of the base model. [2 Marks]
(Optional: Experiment with various Hyperparameters - Research required)
B. Clearly showcase improvement in performance achieved. [1 Marks]
For Example:
A. Accuracy: +15% improvement
B. Precision: +10% improvement.
C. Clearly state which parameters contributed most to improve model performance. [1 Marks]
©Great Learning. Proprietary content. All Rights Reserved. Unauthorised use or distribution prohibited
f
f
f
f
AIML MODULE PROJECT
Part B - 30 Marks
• DOMAIN: Banking, Marketing
• CONTEXT: A bank X is on a massive digital transformation for all its departments. Bank has a growing customer base whee majority of them are
liability customers (depositors) vs borrowers (asset customers). The bank is interested in expanding the borrowers base rapidly to bring in more
business via loan interests. A campaign that the bank ran in last quarter showed an average single digit conversion rate. Digital transformation
being the core strength of the business strategy, marketing department wants to devise effective campaigns with better target marketing to
increase the conversion ratio to double digit with same budget as per last campaign.
• DATA DICTIONARY:
1. ID: Customer ID
2. Age: Customer’s approximate age.
3. CustomerSince: Customer of the bank since. [unit is masked]
4. HighestSpend: Customer’s highest spend so far in one transaction. [unit is masked]
5. ZipCode: Customer’s zip code.
6. HiddenScore: A score associated to the customer which is masked by the bank as an IP.
7. MonthlyAverageSpend: Customer’s monthly average spend so far. [unit is masked]
8. Level: A level associated to the customer which is masked by the bank as an IP.
9. Mortgage: Customer’s mortgage. [unit is masked]
10. Security: Customer’s security asset with the bank. [unit is masked]
11. FixedDepositAccount: Customer’s ixed deposit account with the bank. [unit is masked]
12. InternetBanking: if the customer uses internet banking.
13. CreditCard: if the customer uses bank’s credit card.
14. LoanOnCard: if the customer has a loan on credit card.
• PROJECT OBJECTIVE: Build a Machine Learning model to perform focused marketing by predicting the potential customers who will convert
using the historical dataset.
• STEPS AND TASK [30 Marks]:
1. Data Understanding and Preparation: [5 Marks]
A. Read both the Datasets ‘Data1’ and ‘Data 2’ as DataFrame and store them into two separate variables. [1 Marks]
B. Print shape and Column Names and DataTypes of both the Dataframes. [1 Marks]
C. Merge both the Dataframes on ‘ID’ feature to form a single DataFrame [2 Marks]
D. Change Datatype of below features to ‘Object’ [1 Marks]
‘CreditCard’, ‘InternetBanking’, ‘FixedDepositAccount’, ‘Security’, ‘Level’, ‘HiddenScore’.
[Reason behind performing this operation:- Values in these features are binary i.e. 1/0. But DataType is ‘int’/’ loat’ which is not expected.]
2. Data Exploration and Analysis: [5 Marks]
A. Visualize distribution of Target variable ‘LoanOnCard’ and clearly share insights. [2 Marks]
B. Check the percentage of missing values and impute if required. [1 Marks]
C. Check for unexpected values in each categorical variable and impute with best suitable value. [2 Marks]
[Unexpected values means if all values in a feature are 0/1 then ‘?’, ‘a’, 1.5 are unexpected values which needs treatment ]
3. Data Preparation and model building: [10 Marks]
A. Split data into X and Y. [1 Marks]
[Recommended to drop ID & ZipCode. LoanOnCard is target Variable]
B. Split data into train and test. Keep 25% data reserved for testing. [1 Marks]
C. Train a Supervised Learning Classi ication base model - Logistic Regression. [2 Marks]
D. Print evaluation metrics for the model and clearly share insights. [1 Marks]
E. Balance the data using the right balancing technique. [2 Marks]
i. Check distribution of the target variable
ii. Say output is class A : 20% and class B : 80%
iii. Here you need to balance the target variable as 50:50.
iv. Try appropriate method to achieve the same.
F. Again train the same previous model on balanced data. [1 Marks]
G. Print evaluation metrics and clearly share di erences observed. [2 Marks]
4. Performance Improvement: [10 Marks]
A. Train a base model each for SVM, KNN. [4 Marks]
B. Tune parameters for each of the models wherever required and inalize a model. [3 Marks]
(Optional: Experiment with various Hyperparameters - Research required)
C. Print evaluation metrics for inal model. [1 Marks]
D. Share improvement achieved from base model to inal model. [2 Marks]
©Great Learning. Proprietary content. All Rights Reserved. Unauthorised use or distribution prohibited
f
f
f
ff
f
f
f