ndex
1. Introduction
2. Data Overview
2.1 Dataset Description
2.2 Data Exploration
3. Data Preprocessing
3.1 Data Cleaning
3.2 Feature Engineering
3.3 Train-Test Split
4. Model Development
4.1 Logistic Regression
4.1.1 Model Training
4.1.2 Model Evaluation
4.2 Linear Discriminant Analysis (LDA)
4.2.1 Model Training
4.2.2 Model Evaluation
4.3 Decision Tree
4.3.1 Model Training
4.3.2 Model Evaluation
5. Model Performance Comparison
5.1 Training Set Performance
5.2 Testing Set Performance
6. Feature Importance Analysis
7. Business Recommendations
8. Conclusion
9. References
Define the problem and perform Exploratory Data Analysis
Problem Definition: The primary objective is to analyze and build a machine learning
model to help identify which leads are more likely to convert to paid customers for
ExtraaLearn. This involves:
Analyzing the dataset to understand the features and their relevance to lead conversion.
Building a predictive model to identify leads with a higher probability of conversion.
Determining the factors driving the lead conversion process. Creating a profile of leads
likely to convert based on the insights gained from the model.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)
# Library to split data
from sklearn.model_selection import train_test_split
# To build model for prediction
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay,
precision_recall_curve,
roc_curve,
make_scorer,
)
import warnings
warnings.filterwarnings("ignore")
Shape of the dataset: (4612, 15)
| Column | Data Type |
|------------------------|-------------------|
| ID | Object |
| age | Int64 |
| current_occupation | Object |
| first_interaction | Object |
| profile_completed | Object |
| website_visits | Int64 |
| time_spent_on_website | Int64 |
| page_views_per_visit | Float64 |
| last_activity | Object |
| print_media_type1 | Object |
| print_media_type2 | Object |
| digital_media | Object |
| educational_channels | Object |
| referral | Object |
| status | Int64 |
Statistical summary of numerical columns:
age website_visits time_spent_on_website page_views_per_visit \
count 4612.00000 4612.00000 4612.00000 4612.00000
mean 46.20121 3.56678 724.01127 3.02613
std 13.16145 2.82913 743.82868 1.96812
min 18.00000 0.00000 0.00000 0.00000
25% 36.00000 2.00000 148.75000 2.07775
50% 51.00000 3.00000 376.00000 2.79200
75% 57.00000 5.00000 1336.75000 3.75625
max 63.00000 30.00000 2537.00000 18.43400
status
count 4612.00000
mean 0.29857
std 0.45768
min 0.00000
25% 0.00000
50% 0.00000
75% 1.00000
max 1.00000
Univariate analysis
Number of leads who haven't visited the website: 174
Multivariate analysis
status 0 1 All
current_occupation
All 3235 1377 4612
Professional 1687 929 2616
Unemployed 1058 383 1441
Student 490 65 555
status 0 1 All
first_interaction
All 3235 1377 4612
Website 1383 1159 2542
Mobile App 1852 218 2070
status 0 1 All
profile_completed
All 3235 1377 4612
High 1318 946 2264
Medium 1818 423 2241
Low 99 8 107
status 0 1 All
last_activity
All 3235 1377 4612
Email Activity 1587 691 2278
Website Activity 677 423 1100
Phone Activity 971 263 1234
status 0 1 All
print_media_type1
All 3235 1377 4612
No 2897 1218 4115
Yes 338 159 497
status 0 1 All
print_media_type2
All 3235 1377 4612
No 3077 1302 4379
Yes 158 75 233
status 0 1 All
digital_media
All 3235 1377 4612
No 2876 1209 4085
Yes 359 168 527
status 0 1 All
educational_channels
All 3235 1377 4612
No 2727 1180 3907
Yes 508 197 705
status 0 1 All
referral
All 3235 1377 4612
No 3205 1314 4519
Yes 30 63 93
#Data Preparation for modeling
Shape of Training set : (3228, 4627)
Shape of Test set : (1384, 4627)
Percentage of classes in training set:
Status
0 0.70415
1 0.29585
Name: proportion, dtype: float64
Percentage of classes in test set:
status
0 0.69509
1 0.30491
Name: proportion, dtype: float64
#Model evaluation criterion
Accuracy Recall Precision F1
0 0.80130 0.60190 0.70360 0.64879
Debugging: Inside confusion_matrix_statsmodels function
Model: LogisticRegression()
Predictors shape: (1384, 4627)
Target shape: (1384,)
Predictions shape: (1384,)
Confusion Matrix: [[855 107]
[168 254]]
Model Performance Metrics:
Accuracy Recall Precision F1
0 0.80130 0.60190 0.70360 0.64879
Debugging: Inside confusion_matrix_statsmodels function
Model: LogisticRegression()
Predictors shape: (1384, 4627)
Target shape: (1384,)
Predictions shape: (1384,)
Confusion Matrix: [[855 107]
[168 254]]
#Building Logistic Regression Model
Model performance on test set:
Accuracy Recall Precision F1
0 0.80130 0.60190 0.70360 0.64879
Train ROC-AUC score is : 0.8772828307723493
Test ROC-AUC score is : 0.8565821107290302
#Using GridSearch for Hyperparameter tuning of our logistic
regression model
Train ROC-AUC score is : 0.9989528795811519
Test ROC-AUC score is : 0.38762673537555054
# Checking performance on training set
Performance on Training Set: 0.718091697645601
Confusion Matrix for Training Set:
[[2005 268]
[ 642 313]]
Performance on Test Set: 0.7044797687861272
Confusion Matrix for Test Set:
[[832 130]
[279 143]]
#Building Decision Tree Model
# importance of features in the tree building
|--- time_spent_on_website <= 415.50
| |--- age <= 26.50
| | |--- page_views_per_visit <= 0.04
| | | |--- weights: [2.70, 2.10] class: 0
| | |--- page_views_per_visit > 0.04
| | | |--- page_views_per_visit <= 3.34
| | | | |--- weights: [37.20, 0.70] class: 0
| | | |--- page_views_per_visit > 3.34
| | | | |--- time_spent_on_website <= 138.50
| | | | | |--- weights: [10.50, 0.00] class: 0
| | | | |--- time_spent_on_website > 138.50
| | | | | |--- weights: [17.10, 6.30] class: 0
| |--- age > 26.50
| | |--- page_views_per_visit <= 3.71
| | | |--- time_spent_on_website <= 175.50
| | | | |--- time_spent_on_website <= 169.50
| | | | | |--- weights: [138.90, 77.00] class: 0
| | | | |--- time_spent_on_website > 169.50
| | | | | |--- weights: [0.90, 2.80] class: 1
| | | |--- time_spent_on_website > 175.50
| | | | |--- page_views_per_visit <= 3.68
| | | | | |--- weights: [144.90, 58.10] class: 0
| | | | |--- page_views_per_visit > 3.68
| | | | | |--- weights: [1.20, 2.80] class: 1
| | |--- page_views_per_visit > 3.71
| | | |--- page_views_per_visit <= 3.84
| | | | |--- age <= 58.50
| | | | | |--- weights: [18.90, 0.70] class: 0
| | | | |--- age > 58.50
| | | | | |--- weights: [2.10, 1.40] class: 0
| | | |--- page_views_per_visit > 3.84
| | | | |--- page_views_per_visit <= 3.85
| | | | | |--- weights: [0.00, 1.40] class: 1
| | | | |--- page_views_per_visit > 3.85
| | | | | |--- weights: [84.30, 27.30] class: 0
|--- time_spent_on_website > 415.50
| |--- age <= 25.50
| | |--- website_visits <= 3.50
| | | |--- page_views_per_visit <= 5.39
| | | | |--- time_spent_on_website <= 2223.50
| | | | | |--- weights: [15.00, 16.80] class: 1
| | | | |--- time_spent_on_website > 2223.50
| | | | | |--- weights: [2.70, 0.00] class: 0
| | | |--- page_views_per_visit > 5.39
| | | | |--- weights: [3.60, 0.00] class: 0
| | |--- website_visits > 3.50
| | | |--- time_spent_on_website <= 1933.50
| | | | |--- weights: [14.10, 2.80] class: 0
| | | |--- time_spent_on_website > 1933.50
| | | | |--- time_spent_on_website <= 2039.50
| | | | | |--- weights: [0.30, 1.40] class: 1
| | | | |--- time_spent_on_website > 2039.50
| | | | | |--- weights: [3.60, 1.40] class: 0
| |--- age > 25.50
| | |--- time_spent_on_website <= 2204.00
| | | |--- weights: [169.20, 403.90] class: 1
| | |--- time_spent_on_website > 2204.00
| | | |--- weights: [14.70, 61.60] class: 1
#Feature Importance
Training performance comparison:
Training performance comparison:
Logistic Regression LDA Decision Tree
Accuracy 0.71747 0.71809 0.99318
Recall 0.28168 0.32775 0.97696
Precision 0.54343 0.53873 1.00000
F1 0.37103 0.40755 0.98835
Testing performance comparison:
Logistic Regression LDA Decision Tree
Accuracy 0.70014 0.70448 0.64451
Recall 0.27725 0.33886 0.45498
Precision 0.51542 0.52381 0.42291
F1 0.36055 0.41151 0.43836
Use Logistic Regression or LDA for Predictive Modeling: Both Logistic
Regression and LDA demonstrate stable performance across training and
testing sets. They offer a good balance between accuracy, recall, precision,
and F1-score. Therefore, they can be reliable choices for predictive modeling
in your business context.
Consider the Decision Tree Model for Further Investigation: Although the
Decision Tree model shows high accuracy on the training set, its performance
on the testing set is lower compared to Logistic Regression and LDA. This
suggests potential overfitting. Further investigation into the decision tree
model's structure, feature importance, and potential pruning techniques may
help improve its generalization performance.
Evaluate Feature Importance: Analyze the feature importance provided by
each model to understand which variables contribute most to the prediction.
This can provide valuable insights into customer behavior and preferences.
For example, features related to website visits, time spent on the website, and
page views per visit appear to be influential in predicting customer status.
Refine Marketing Strategies: Tailor marketing strategies based on the insights
gained from predictive modeling. For instance, focus marketing efforts on
customers who exhibit behaviors indicative of higher conversion rates, such
as frequent website visits, longer time spent on the website, and higher page
views per visit.
Regular Model Monitoring and Updates: Continuously monitor model
performance and update the models as new data becomes available.
Customer behavior and preferences may evolve over time, requiring
adjustments to the predictive models to maintain their effectiveness.
Aim for Balanced Performance Metrics: Aim for a balanced combination of
accuracy, recall, precision, and F1-score, depending on the specific business
objectives and constraints. For example, if false positives (predicting a
customer will convert when they won't) are costly, prioritize precision. If false
negatives (failing to identify potential converters) are more concerning,
prioritize recall.
By incorporating these recommendations into your business strategy, you can
leverage predictive modeling techniques to better understand customer
behavior, optimize marketing efforts, and ultimately improve conversion rates.