0% found this document useful (0 votes)

58 views42 pages

Predictive Modeling

Uploaded by

amansinhmar2303

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views42 pages

Predictive Modeling

Uploaded by

amansinhmar2303

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

ndex

1. Introduction
2. Data Overview
 2.1 Dataset Description
 2.2 Data Exploration
3. Data Preprocessing
 3.1 Data Cleaning
 3.2 Feature Engineering
 3.3 Train-Test Split
4. Model Development
 4.1 Logistic Regression
 4.1.1 Model Training
 4.1.2 Model Evaluation
 4.2 Linear Discriminant Analysis (LDA)
 4.2.1 Model Training
 4.2.2 Model Evaluation
 4.3 Decision Tree
 4.3.1 Model Training
 4.3.2 Model Evaluation
5. Model Performance Comparison
 5.1 Training Set Performance
 5.2 Testing Set Performance
6. Feature Importance Analysis
7. Business Recommendations
8. Conclusion
9. References

Define the problem and perform Exploratory Data Analysis

Problem Definition: The primary objective is to analyze and build a machine learning
model to help identify which leads are more likely to convert to paid customers for
ExtraaLearn. This involves:
Analyzing the dataset to understand the features and their relevance to lead conversion.
Building a predictive model to identify leads with a higher probability of conversion.
Determining the factors driving the lead conversion process. Creating a profile of leads
likely to convert based on the insights gained from the model.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# libaries to help with data visualization

import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns

pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)

# Library to split data

from sklearn.model_selection import train_test_split

# To build model for prediction

from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To tune different models

from sklearn.model_selection import GridSearchCV

# To get diferent metric scores

from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay,
precision_recall_curve,
roc_curve,
make_scorer,
)

import warnings
warnings.filterwarnings("ignore")
Shape of the dataset: (4612, 15)

| Column | Data Type |

|------------------------|-------------------|
| ID | Object |
| age | Int64 |
| current_occupation | Object |
| first_interaction | Object |
| profile_completed | Object |
| website_visits | Int64 |
| time_spent_on_website | Int64 |
| page_views_per_visit | Float64 |
| last_activity | Object |
| print_media_type1 | Object |
| print_media_type2 | Object |
| digital_media | Object |
| educational_channels | Object |
| referral | Object |
| status | Int64 |

Statistical summary of numerical columns:

age website_visits time_spent_on_website page_views_per_visit \

count 4612.00000 4612.00000 4612.00000 4612.00000

mean 46.20121 3.56678 724.01127 3.02613

std 13.16145 2.82913 743.82868 1.96812

min 18.00000 0.00000 0.00000 0.00000

25% 36.00000 2.00000 148.75000 2.07775

50% 51.00000 3.00000 376.00000 2.79200

75% 57.00000 5.00000 1336.75000 3.75625

max 63.00000 30.00000 2537.00000 18.43400

status
count 4612.00000
mean 0.29857
std 0.45768
min 0.00000
25% 0.00000
50% 0.00000
75% 1.00000
max 1.00000

Univariate analysis

Number of leads who haven't visited the website: 174

Multivariate analysis

status 0 1 All
current_occupation
All 3235 1377 4612
Professional 1687 929 2616
Unemployed 1058 383 1441
Student 490 65 555
status 0 1 All
first_interaction
All 3235 1377 4612
Website 1383 1159 2542
Mobile App 1852 218 2070
status 0 1 All
profile_completed
All 3235 1377 4612
High 1318 946 2264
Medium 1818 423 2241
Low 99 8 107
status 0 1 All
last_activity
All 3235 1377 4612
Email Activity 1587 691 2278
Website Activity 677 423 1100
Phone Activity 971 263 1234
status 0 1 All
print_media_type1
All 3235 1377 4612
No 2897 1218 4115
Yes 338 159 497
status 0 1 All
print_media_type2
All 3235 1377 4612
No 3077 1302 4379
Yes 158 75 233
status 0 1 All
digital_media
All 3235 1377 4612
No 2876 1209 4085
Yes 359 168 527
status 0 1 All
educational_channels
All 3235 1377 4612
No 2727 1180 3907
Yes 508 197 705
status 0 1 All
referral
All 3235 1377 4612
No 3205 1314 4519
Yes 30 63 93
#Data Preparation for modeling

Shape of Training set : (3228, 4627)

Shape of Test set : (1384, 4627)
Percentage of classes in training set:
Status

0 0.70415
1 0.29585

Name: proportion, dtype: float64

Percentage of classes in test set:
status
0 0.69509
1 0.30491
Name: proportion, dtype: float64

#Model evaluation criterion

Accuracy Recall Precision F1
0 0.80130 0.60190 0.70360 0.64879
Debugging: Inside confusion_matrix_statsmodels function
Model: LogisticRegression()
Predictors shape: (1384, 4627)
Target shape: (1384,)
Predictions shape: (1384,)
Confusion Matrix: [[855 107]
[168 254]]
Model Performance Metrics:
Accuracy Recall Precision F1
0 0.80130 0.60190 0.70360 0.64879
Debugging: Inside confusion_matrix_statsmodels function
Model: LogisticRegression()
Predictors shape: (1384, 4627)
Target shape: (1384,)
Predictions shape: (1384,)
Confusion Matrix: [[855 107]
[168 254]]

#Building Logistic Regression Model

Model performance on test set:

Accuracy Recall Precision F1
0 0.80130 0.60190 0.70360 0.64879
Train ROC-AUC score is : 0.8772828307723493

Test ROC-AUC score is : 0.8565821107290302

#Using GridSearch for Hyperparameter tuning of our logistic
regression model
Train ROC-AUC score is : 0.9989528795811519

Test ROC-AUC score is : 0.38762673537555054

# Checking performance on training set

Performance on Training Set: 0.718091697645601

Confusion Matrix for Training Set:
[[2005 268]
[ 642 313]]

Performance on Test Set: 0.7044797687861272

Confusion Matrix for Test Set:
[[832 130]
[279 143]]
#Building Decision Tree Model

# importance of features in the tree building

|--- time_spent_on_website <= 415.50

Training performance comparison:

Logistic Regression LDA Decision Tree
Accuracy 0.71747 0.71809 0.99318
Recall 0.28168 0.32775 0.97696
Precision 0.54343 0.53873 1.00000
F1 0.37103 0.40755 0.98835
Testing performance comparison:
Logistic Regression LDA Decision Tree
Accuracy 0.70014 0.70448 0.64451
Recall 0.27725 0.33886 0.45498
Precision 0.51542 0.52381 0.42291
F1 0.36055 0.41151 0.43836

Use Logistic Regression or LDA for Predictive Modeling: Both Logistic

Regression and LDA demonstrate stable performance across training and
testing sets. They offer a good balance between accuracy, recall, precision,
and F1-score. Therefore, they can be reliable choices for predictive modeling
in your business context.
Consider the Decision Tree Model for Further Investigation: Although the
Decision Tree model shows high accuracy on the training set, its performance
on the testing set is lower compared to Logistic Regression and LDA. This
suggests potential overfitting. Further investigation into the decision tree
model's structure, feature importance, and potential pruning techniques may
help improve its generalization performance.
Evaluate Feature Importance: Analyze the feature importance provided by
each model to understand which variables contribute most to the prediction.
This can provide valuable insights into customer behavior and preferences.
For example, features related to website visits, time spent on the website, and
page views per visit appear to be influential in predicting customer status.
Refine Marketing Strategies: Tailor marketing strategies based on the insights
gained from predictive modeling. For instance, focus marketing efforts on
customers who exhibit behaviors indicative of higher conversion rates, such
as frequent website visits, longer time spent on the website, and higher page
views per visit.
Regular Model Monitoring and Updates: Continuously monitor model
performance and update the models as new data becomes available.
Customer behavior and preferences may evolve over time, requiring
adjustments to the predictive models to maintain their effectiveness.
Aim for Balanced Performance Metrics: Aim for a balanced combination of
accuracy, recall, precision, and F1-score, depending on the specific business
objectives and constraints. For example, if false positives (predicting a
customer will convert when they won't) are costly, prioritize precision. If false
negatives (failing to identify potential converters) are more concerning,
prioritize recall.
By incorporating these recommendations into your business strategy, you can
leverage predictive modeling techniques to better understand customer
behavior, optimize marketing efforts, and ultimately improve conversion rates.

AML Project LearnerNotebook LowCode
No ratings yet
AML Project LearnerNotebook LowCode
74 pages
Customer Data Analysis & Feature Engineering
No ratings yet
Customer Data Analysis & Feature Engineering
35 pages
E-Commerce Product Delivery Prediction
No ratings yet
E-Commerce Product Delivery Prediction
13 pages
Bank Marketing Data Analysis
No ratings yet
Bank Marketing Data Analysis
18 pages
Urban Clap - Anu
No ratings yet
Urban Clap - Anu
10 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
Assignment 1
No ratings yet
Assignment 1
8 pages
ML Project 2
No ratings yet
ML Project 2
19 pages
Machine Learning Project - Parijat
No ratings yet
Machine Learning Project - Parijat
26 pages
Assignment3: 1) Identify Percentage of Missing Values in Each Column and Display The Same
No ratings yet
Assignment3: 1) Identify Percentage of Missing Values in Each Column and Display The Same
30 pages
Name - Atharva Navghane Roll No - 2301117 Div B Krai Flip Classroom Assignment On Regression Analysis
No ratings yet
Name - Atharva Navghane Roll No - 2301117 Div B Krai Flip Classroom Assignment On Regression Analysis
58 pages
Howxtre
No ratings yet
Howxtre
8 pages
SEO-Optimized DM Project Report
No ratings yet
SEO-Optimized DM Project Report
43 pages
Machine Learning Project
67% (3)
Machine Learning Project
30 pages
ENews Express Learner Notebook5D
No ratings yet
ENews Express Learner Notebook5D
22 pages
Dự báo và phát triển kinh doanh
No ratings yet
Dự báo và phát triển kinh doanh
43 pages
Group 3
No ratings yet
Group 3
15 pages
02-Linear Regression Project - Solutions
No ratings yet
02-Linear Regression Project - Solutions
12 pages
Exercise Univariate Analysis - Andoni Fikri - 13118111
No ratings yet
Exercise Univariate Analysis - Andoni Fikri - 13118111
9 pages
Employee Analysis
No ratings yet
Employee Analysis
19 pages
ML - Extended Project Business Report-Richa
No ratings yet
ML - Extended Project Business Report-Richa
32 pages
ML Cops
No ratings yet
ML Cops
17 pages
Cars Project PDF
No ratings yet
Cars Project PDF
9 pages
Kinh Tế Lượng
No ratings yet
Kinh Tế Lượng
15 pages
Machine Learning
100% (2)
Machine Learning
30 pages
Machine Learning Extended Project - BrahmaChari
No ratings yet
Machine Learning Extended Project - BrahmaChari
29 pages
Weka
No ratings yet
Weka
9 pages
DW 14
No ratings yet
DW 14
14 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
Data Science Jobs & Salaries Report
No ratings yet
Data Science Jobs & Salaries Report
8 pages
Group 6 CC07
No ratings yet
Group 6 CC07
36 pages
Machine Learning VIVEK
80% (5)
Machine Learning VIVEK
118 pages
Neural Network
No ratings yet
Neural Network
7 pages
Predictive - Modelling - Project - PDF 1
No ratings yet
Predictive - Modelling - Project - PDF 1
31 pages
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
No ratings yet
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
12 pages
Project 1 Austo Automobiles
No ratings yet
Project 1 Austo Automobiles
10 pages
Project Report Abhay PDF
100% (1)
Project Report Abhay PDF
20 pages
Air France: Growth & Market Strategy
No ratings yet
Air France: Growth & Market Strategy
8 pages
Abigail Tsani Darmawan - Streamlining Bank Campaign Promotion (Batch 16)
No ratings yet
Abigail Tsani Darmawan - Streamlining Bank Campaign Promotion (Batch 16)
56 pages
Employee Burnout Analysis
No ratings yet
Employee Burnout Analysis
20 pages
Employee Attrition Data Analysis
No ratings yet
Employee Attrition Data Analysis
9 pages
Linear Regression for Beginners
No ratings yet
Linear Regression for Beginners
6 pages
Chapter 1
No ratings yet
Chapter 1
19 pages
Capstone Project Final Report Rupesh Kumar PGP-DSBA APR 21C
No ratings yet
Capstone Project Final Report Rupesh Kumar PGP-DSBA APR 21C
77 pages
Satya772244@gmail Compdf
No ratings yet
Satya772244@gmail Compdf
7 pages
Aditya Slides For IBM
No ratings yet
Aditya Slides For IBM
125 pages
Exploratory Data Analysis For Machine Learning
No ratings yet
Exploratory Data Analysis For Machine Learning
11 pages
Data Analysis for Workforce Insights
No ratings yet
Data Analysis for Workforce Insights
12 pages
Exercise 5 - Vlookup and SUMIF
No ratings yet
Exercise 5 - Vlookup and SUMIF
211 pages
Python Lab Manual
No ratings yet
Python Lab Manual
33 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
30 pages
PM Guided Project
No ratings yet
PM Guided Project
25 pages
Abinash Nag Project Report CART
No ratings yet
Abinash Nag Project Report CART
40 pages
Basic of Statistics
No ratings yet
Basic of Statistics
4 pages
Lecture 2
No ratings yet
Lecture 2
30 pages
Big Data Computing Course Guide
No ratings yet
Big Data Computing Course Guide
36 pages
Let's Interact! Modeling Interaction Effects in Linear and Generalized Linear Models Using SAS
No ratings yet
Let's Interact! Modeling Interaction Effects in Linear and Generalized Linear Models Using SAS
69 pages
k-7 Means
No ratings yet
k-7 Means
2 pages
DM Project
No ratings yet
DM Project
36 pages
BCA Students' C Programming Guide
No ratings yet
BCA Students' C Programming Guide
43 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
26 pages
Closure Properties of Regular Languages
No ratings yet
Closure Properties of Regular Languages
32 pages
A Machine Learning Analysis of COVID 19 Mental Health Data: Mostafa Rezapour & Lucas Hansen
No ratings yet
A Machine Learning Analysis of COVID 19 Mental Health Data: Mostafa Rezapour & Lucas Hansen
16 pages
Engineering Mathematics I Syllabus
No ratings yet
Engineering Mathematics I Syllabus
4 pages
To Matrix Analytic Methods in Stochastic Modeling: G. Latouche V. Ramaswami
No ratings yet
To Matrix Analytic Methods in Stochastic Modeling: G. Latouche V. Ramaswami
1 page
AES URL Encryption Java Program
No ratings yet
AES URL Encryption Java Program
3 pages
A Graph Analytical Approach For Topic Detection
No ratings yet
A Graph Analytical Approach For Topic Detection
21 pages
QCD Matching Conditions at Thresholds: Germfin Rodrigo
No ratings yet
QCD Matching Conditions at Thresholds: Germfin Rodrigo
6 pages
A Robust Regression Method Based On Exponential-Type Kernel Functions - de Carvalho Et Al
No ratings yet
A Robust Regression Method Based On Exponential-Type Kernel Functions - de Carvalho Et Al
47 pages
RThiede OrdinaryKriging
No ratings yet
RThiede OrdinaryKriging
17 pages
Neural Computingand Applicationsword Template
No ratings yet
Neural Computingand Applicationsword Template
2 pages
Data Structure MCA1 Sem
No ratings yet
Data Structure MCA1 Sem
5 pages
NASODE Lecture Notes
No ratings yet
NASODE Lecture Notes
329 pages
Lecture09 CE72.12Isoparametric Formulation
No ratings yet
Lecture09 CE72.12Isoparametric Formulation
14 pages
Alg DS1 Example Test 1
No ratings yet
Alg DS1 Example Test 1
3 pages
PDQ Statistics - 3rd Edition Full MOBI Ebook
100% (14)
PDQ Statistics - 3rd Edition Full MOBI Ebook
15 pages
Tutorial 4 Solutions - Forecasting
100% (1)
Tutorial 4 Solutions - Forecasting
7 pages
A Review of Data Mining Technologies in Building Energy Systems
No ratings yet
A Review of Data Mining Technologies in Building Energy Systems
16 pages
Mste 1
No ratings yet
Mste 1
7 pages
Introduction to Econometrics Course
No ratings yet
Introduction to Econometrics Course
49 pages
Whatsapp Security: Made By: Abdelrahman Badawy Yousef Abdelfatah Subervised By: Eng/Mai Magdy
100% (1)
Whatsapp Security: Made By: Abdelrahman Badawy Yousef Abdelfatah Subervised By: Eng/Mai Magdy
8 pages
Howard1968 - The Foundations of Decision Analysis
No ratings yet
Howard1968 - The Foundations of Decision Analysis
9 pages
Engineering Inventory Management System
No ratings yet
Engineering Inventory Management System
1 page
On Solving 2D and 3D Puzzles Using Curve Matching
No ratings yet
On Solving 2D and 3D Puzzles Using Curve Matching
8 pages
Docx
No ratings yet
Docx
13 pages
Dual Simplex Method & LP Duality
No ratings yet
Dual Simplex Method & LP Duality
32 pages
XCS224N Module5 Slides
No ratings yet
XCS224N Module5 Slides
80 pages
Agustin 2022 - The Effect of Using E-Learning and Divergent Thinking Patterns On Economic Learning Achievement
No ratings yet
Agustin 2022 - The Effect of Using E-Learning and Divergent Thinking Patterns On Economic Learning Achievement
11 pages
Phase-Type Distribution Guide
No ratings yet
Phase-Type Distribution Guide
7 pages

Predictive Modeling

Uploaded by

Predictive Modeling

Uploaded by

ndex

Define the problem and perform Exploratory Data Analysis

# libaries to help with data visualization

# Removes the limit for the number of displayed columns

# Library to split data

# To build model for prediction

# To tune different models

# To get diferent metric scores

| Column | Data Type |

Statistical summary of numerical columns:

age website_visits time_spent_on_website page_views_per_visit \

count 4612.00000 4612.00000 4612.00000 4612.00000

mean 46.20121 3.56678 724.01127 3.02613

std 13.16145 2.82913 743.82868 1.96812

min 18.00000 0.00000 0.00000 0.00000

25% 36.00000 2.00000 148.75000 2.07775

50% 51.00000 3.00000 376.00000 2.79200

75% 57.00000 5.00000 1336.75000 3.75625

max 63.00000 30.00000 2537.00000 18.43400

Number of leads who haven't visited the website: 174

Shape of Training set : (3228, 4627)

Name: proportion, dtype: float64

#Model evaluation criterion

#Building Logistic Regression Model

Model performance on test set:

Test ROC-AUC score is : 0.8565821107290302

Test ROC-AUC score is : 0.38762673537555054

Performance on Training Set: 0.718091697645601

Performance on Test Set: 0.7044797687861272

# importance of features in the tree building

|--- time_spent_on_website <= 415.50

Training performance comparison:

Training performance comparison:

Use Logistic Regression or LDA for Predictive Modeling: Both Logistic

You might also like