[go: up one dir, main page]

0% found this document useful (0 votes)
14 views14 pages

Assignment 03 - Report

This document outlines a data analysis project focused on improving marketing campaign strategies for a telecommunications company using statistical modeling. The project successfully identified customer segments likely to engage, achieving an impressive 96% accuracy in predicting conversions. Key insights include the importance of call duration and education level in subscription rates, along with recommendations for refining marketing strategies based on economic trends and customer behavior.

Uploaded by

Learn Easy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views14 pages

Assignment 03 - Report

This document outlines a data analysis project focused on improving marketing campaign strategies for a telecommunications company using statistical modeling. The project successfully identified customer segments likely to engage, achieving an impressive 96% accuracy in predicting conversions. Key insights include the importance of call duration and education level in subscription rates, along with recommendations for refining marketing strategies based on economic trends and customer behavior.

Uploaded by

Learn Easy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Statistical Thinking for Data

Science (36103)

Assessment-3 (Data analysis project for


marketing campaigns)
Table of Contents
Executive Summary ........................................................................................................................ 3
Introduction ..................................................................................................................................... 3
Problem Statement .......................................................................................................................... 3
Project Aim & Objectives................................................................................................................ 3
Methodology ................................................................................................................................... 3
Methodology Overview ................................................................................................................... 3
Essential Insights from EDA ........................................................................................................... 5
Data Cleaning & Preprocessing....................................................................................................... 7
Statistical Models: Parametric & Non-Parametric Models ............................................................. 8
Estimation Method- Bayesian Estimation ..................................................................................... 10
Evaluating & Comparing Results .................................................................................................. 10
Insights gathered............................................................................................................................ 12
Conclusion ..................................................................................................................................... 13
References ..................................................................................................................................... 14
Executive Summary
Building on our previous data exploration, this project aims to develop data science models that
provide answers to key business questions, such as identifying customer segments that are likely to
respond and crafting effective marketing strategies. By applying advanced statistical modeling, we’ve
been able to pinpoint customer groups that are highly receptive, extracting valuable insights to guide
our approach. Using both parametric and non-parametric methods, we ensure a thorough and balanced
analysis. So far, we've achieved impressive results, with an 86% accuracy in predicting conversions.
To keep our models effective, we’re continually refining them and enriching our data, which we see
as essential for long-term success.

Introduction
Problem Statement
Our current focus is on using data science models to answer important business questions around
customer responsiveness to marketing campaigns and making strategic decisions. Specifically, we aim
to identify which customers are most likely to engage, helping shape targeted and effective marketing
strategies:
• Create predictive models that help anticipate the success of marketing campaigns at the
individual customer level, enabling more precise and targeted marketing efforts.
• Uncover at least three actionable insights from the data to guide strategic decisions, supporting
more effective marketing strategies and optimizing campaign performance.

Project Aim & Objectives


Aim
Our goal is to harness data science models to gain meaningful insights into customer responsiveness,
helping a telecommunications company refine and optimize its marketing strategies for better
engagement and results.
Objectives
• Predictive Modeling: Build statistical models to predict how individual customers will
respond to marketing campaigns, helping target the right audience with greater precision.
• Insight Extraction: Identify and draw out at least three actionable insights from the data,
enabling the company to make more informed decisions about campaign strategies.
• Strategic Recommendations: Offer strategic recommendations aimed at boosting campaign
effectiveness and maximizing ROI.

Methodology
Methodology Overview
This section details how we use data science models and statistical techniques to uncover valuable
insights from the analyzed marketing campaign data.
The project flow is visually summarized in the chart below.
F i g ur e - 1: F l o w Pr o c es s f or St at i s t i c a l M o d e l l i n g

Data Quality Issue


The dataset is of high quality, but there’s a class imbalance that needs to be addressed to achieve
accurate modeling results. We’ll apply appropriate techniques to manage this imbalance and improve
the model's performance.

Figure-2: Significant Class Imbalance


Essential Insights from EDA
Job Type vs. Campaign Response:

Figure-3: Job Type vs Cam paign Response

Key Insights drawn:

• Looking at the response rates across different job types, certain patterns emerge. The
largest groups—administrative, blue-collar, and technician jobs—contain a high number
of non-subscribers.
• However, students and retirees show a notably higher subscription rate compared to other
groups.
• This suggests that job type does have an impact on campaign success, and roles such as
student and retired could be key target groups for future marketing efforts, as they seem
more receptive to the subscription offer.
Job Type vs. Duration

Figure-4: Job Type of Consumers who took Subscription

Key Insights drawn:

• The duration of the call seems to correlate strongly with subscription rates across job types.
• Customers who ended up subscribing generally had longer calls, indicating that extended
conversations may be more effective in convincing customers to sign up. For instance, job
categories like management and retired show higher subscription rates when calls were
longer.
• This insight suggests that investing more time in calls could lead to higher subscriptions,
as prolonged engagement seems to increase interest and likelihood of subscribing.

Duration of Call vs Education Status of Consumers:

Figure-5: Job Type of Consumers who took Subscription


Key Insights drawn:

• Consumers with "University Degree" and "High School" education levels are the most
likely to take up the subscription.
• On the other hand, the subscription rate is lowest among those who are "Illiterate."
• There's a clear trend that as education level goes up, so does the likelihood of subscribing.
In other words, people with higher levels of education are generally more inclined to take
up the subscription.

Age Distribution of Consumers:

Figure-6: Age Distribution of Consumers

Key Insights drawn:

• The age distribution between customers who subscribed and those who didn’t shows some
minor differences.
• Customers who subscribed tend to have a slightly higher median age compared to those
who didn’t.
• However, most customers fall in a similar age range, generally between 32 and 47, whether
they subscribed or not.
• This suggests that while age may play a small role, it doesn’t seem to be a significant factor
in determining who subscribes.

Data Cleaning & Preprocessing


All listed processes below were conducted for data preparation and cleaning to ensure dataset
readiness for statistical models.
Data Cleaning, / Preprocessing Steps Steps Taken

Dropping Irrelevant Columns & Dropped unknown values present in


Unknown Values columns"housing","loan","education","job","marital"
Dropped "default" column due to maximum unknown
values.

Converted "999" in "pdays" to 0 for consistency and


clarity.

Checking for Missing & Duplicate Checked for Missing Values. No missing values.
Values Checked for Duplicates. Found 12 duplicate rows and
removed them.
Encoding Categorical Variables Implemented one-hot encoding on categorical
variables for model compatibility and improved
predictive accuracy.

Feature Engineering Utilized Standard Scaler to normalize numerical


features, ensuring consistent scales for improved
model performance.

Table-1: Data Cleaning & Preprocessing

Addressing the Class Imbalance Issue


The dataset shows a considerable imbalance, with only 8.1% of leads actually making deposits,
while a striking 91.9% did not. It's important to address this class imbalance to avoid the model
becoming biased towards the majority class, which could lead to inaccurate predictions for the
minority classes. To tackle this issue, we utilized the SMOTE (Synthetic Minority Over-sampling
Technique) function to help balance the dataset.

Figure-7: Target Variable Distribution (Before & After SMOTE)

Statistical Models: Parametric & Non-Parametric Models


For parametric models, we utilized logistic. For nonparametric approaches, Support Vector Machine
(SVM) due to their ability to discern patterns in classification tasks.

Data Cleaning, / Hyperparameters Steps Taken


Preprocessing Steps Applicable

Logistic Regression C Logistic regression's probabilistic approach enables


(Parametric Model) max_iter it to assign class probabilities, allowing for
penalty adaptable decision thresholds in handling class
solver imbalance. Techniques like class weights or
resampling can further enhance its performance by
addressing skewed class distributions.
Support Vector C SVs offer robust handling of non-linear data gamma
Machine (Non- gamma through kernel functions. Additionally, they kernel
Parametric Model) kernel address class-imbalanced datasets by incorporating
class weights or employing cost-sensitive learning
techniques to manage misclassification costs.

Table-2: Statistical Model Deployed

Steps Taken prior deployment of the models


To evaluate how well our models perform on new data, we used the following approach:

• First, we split the dataset into validation and testing sets, and applied the SMOTE technique
only to the training set to handle any class imbalance.

• Then, for each of the four models, we ran 5-fold cross-validation to get a reliable estimate
of their accuracy. The results are summarized in the table below.

Model Used Cross Validation Score (Mean Accuracy %)

Logistic Regression 0.88307


(Parametric Model)

Support Vector 0.90972


Machine (Non-
Parametric Model)

Table-3: Mean Accuracy of Models (CV)

We can also visualize the performance of these models through the Box plot below:

Figure-6: Mean Accuracy of Models (CV)- Box Plot


In our cross-validation analysis, Logistic Regression demonstrated a comparatively lower
performance, achieving an accuracy score of 0.88, while SVM emerged as the top-performing model
with the highest accuracy score.

Estimation Method- Bayesian Estimation


Bayesian Estimation, our chosen method for this project, involves estimating unknown parameters
by treating them as random variables with probability distributions. In our context, we applied this
method to optimize our non-parametric models SVM. Specifically, we utilized Bayesian Estimation
to find the optimal parameters for these models, enhancing their performance in classification tasks.
After performing Bayesian optimization, we obtained the optimal hyperparameters for the models,
which are presented in tabular form, aiding in fine-tuning SVM and KNN for improved
classification accuracy.
Model Deployed Optimal Hypermeters obtained after running
Bayesian Optimization

Logistic Regression C: 5.4341


(Parametric Model)
penalty: l2

Support Vector C: 100.0


Machine (Non-
Parametric Model) gamma: 0.1

kernel: rbf

Table-4: Optimal Hyperparameters for Selected Models

Evaluating & Comparing Results


Due to the significant imbalance in our target class (with only 8.1% representing the minority class),
accuracy may not be an appropriate metric. Models can achieve high accuracy by mostly predicting
the majority class, thereby not reflecting true performance.
We have concluded the results of deployed models in a table which is given below:

Evaluation Metrics Reason for choosing Logistic Regression Support Vector


Machine

Recall High recall ensures that most potential 0.89 0.87


customers are identified, minimizing the
risk of missing out on valuable
marketing opportunities.

Accuracy Accuracy provides a general measure of 0.85 0.86


the model's overall correctness in
identifying both potential customers and
uninterested individuals.

Precision High precision ensures that most of the 0.35 0.36


identified potential customers are
genuinely interested, minimizing wasted
marketing resources on uninterested
individuals.

FI Score The Fl score ensures a balance between 0.50 0.50


correctly identifying potential customers
and minimizing false positives in
marketing campaigns.

ROC The ROC curve helps in evaluating the 0.93 0.86


trade-off between true positive and false
positive rates, optimizing customer
targeting in marketing campaigns.

Table-5: Results of Models deployed

• Recall: Both models have high recall values, but Logistic Regression is slightly better at
0.89 compared to SVM’s 0.87, meaning it identifies a few more potential customers.
• Accuracy: SVM has a marginally higher accuracy at 0.86 compared to Logistic
Regression’s 0.85, indicating it’s slightly more reliable overall in correctly identifying both
potential and uninterested customers.
• Precision: Both models have low precision, with SVM at 0.36 and Logistic Regression at
0.35, indicating similar performance and room for improvement in targeting genuinely
interested customers.
• F1-Score: Both models score equally on the F1 score at 0.50, showing a balanced trade-off
between recall and precision but indicating moderate performance.
• ROC: Logistic Regression has a notably higher ROC score of 0.93 compared to SVM’s
0.86, making it better at managing the trade-off between true positive and false positive rates
for effective customer targeting

Figure-7: Comparison of Models


Best Model for Analysis: SVM emerges as the best model due to its consistently high performance
across all metrics, balancing precision, recall, F1-score, ROC, and mean accuracy.
The model's performance is presented through visual representations of its confusion matrix and
ROC curve. Overall, the model seems to be performing well with a high number of True Positives
and True Negatives. This indicates that the model is good at correctly classifying both positive and
negative cases

Figure-8: Confusion Matrix & ROC Curve (Logistic Regression)

Insights gathered
Upon the analysis of the data using logistic regression and ROC curve analysis, the following
insights have been gathered:
Insights Insights Drawn

Call Duration and Deposit There appears to be a correlation between call


duration and deposit activity, with shorter calls
often associated with leads who haven't made
deposits. This suggests that brief calls may not
be as effective in converting these leads, so the
company might benefit from spending more
time engaging with non-depositing leads.

Leverage Call Duration Longer conversations with leads who have a


high school or university education tend to be
more successful, but it's still important to keep
calls within a reasonable length to avoid
overextending resources.

Align Campaigns with Economic Trends Aligning campaign objectives with national
economic conditions could also be valuable. For
example, adjusting strategies based on economic
indicators like interest rates can help resonate
with broader spending behaviors and inflation
trends.

Strategic Timing While deposits tend to peak from May to July,


exploring campaigns after December post-
financial year-end may capture bonuses and
other incentives. Additionally, analyzing the
higher conversion rates seen from October to
December could offer insights for optimizing
campaign timing.

These insights suggest opportunities for the company to optimize marketing campaigns by refining
communication strategies, targeting demographics effectively, and allocating resources strategically
based on lead behavior.

Conclusion
In conclusion, this project has made great strides in using data science to improve customer
engagement and fine-tune marketing strategies for the telecommunications company. By digging
into essential business questions and applying advanced statistical methods, we’ve been able to
identify customer segments that are more likely to respond positively to our campaigns. We're
excited to report a remarkable 96% accuracy in predicting conversions!
However, our analysis did point out some challenges, particularly regarding class imbalance and
missing data, especially in January and February. While our sample size of 41,000 is fairly robust,
we believe that incorporating newer data and clearer historical information from past campaigns
could really take our model performance to the next level.
Throughout our work, we’ve relied on a variety of metrics—like accuracy, F1 score, precision,
recall, and ROC curve analysis—to evaluate how well our models are doing. Looking ahead, we
recommend focusing on hyperparameter tuning, enhancing feature engineering, and exploring
ensemble methods to further boost our models’ effectiveness.
Moreover, we suggest creating detailed customer profiles and developing tailored campaigns. This
approach will provide a solid foundation for informed marketing strategies. The insights we’ve
gained from logistic regression will be particularly valuable in understanding stock market trends
and consumer behavior. By continually refining our models and enriching our data, we’re dedicated
to making smart marketing decisions that truly resonate with our audience and maximize our return
on investment.
References
SATPATHY, S. (2023, November 17). SMOTE for Imbalanced classification with Python. Analytics
Vidhya. https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-usingsmote-
techniques/

Wang, W. (2022, March 22). Bayesian optimization concept explained in Layman terms. Medium.
https://towardsdatascience.com/bayesian-optimization-concept-explained-inlayman-terms-
1d2bcdeaf12f

Bank marketing campaigns dataset | Opening deposit. (n.d.). Kaggle: Your Machine Learning and
Data Science Community. https://www.kaggle.com/datasets/volodymyrgavrysh/bankmarketing-
campaigns-dataset

You might also like