Reference Report 2
Reference Report 2
CHURN
P a g e 1 | 43
Content of Report
SL Content Page
NO. Number
1 Introduction To Business Problem 3
2 EDA and Business Implications 4-14
3 Data Cleaning & Pre-Processing 14-23
4 Model Building & Improve Model Performance 24-31
5 Model Comparison 32
6 Model Validation 33-34
7 Final Interpretation & Recommendations 35-38
8 Appendix 38-43
P a g e 2 | 43
1. Introduction of the Business Problem
Problem Statement: -
An E-commerce company is facing a lot of competition in the current market and it has
become a challenge to retain the existing customers in the current competitive situation.
Hence, the DTH company wants to develop a model through which they can do churn
prediction of the accounts and provide segmented offers to the potential churners. In this
company, account churn is a major thing because 1 account can have multiple customers.
hence by losing one account, the company might be losing more than one customer.
we have been assigned to develop a churn prediction model for this company and provide
business recommendations on the campaign. The model or campaign has to be unique and
has to sharp when offers are suggested. The offers suggested should have a win-win situation
for company as well as customers so that company doesn’t hit on revenue and on the other
hand able to retain the customers.
Need of the study/project
This study/project is very essential for the business to plan for future in terms of product
designing, sales or in rolling out different offers for different segment of clients. The outcome
of this project will give a clear understanding where the firm stands now and what’s the
capacity it holds in terms for taking risk. It will also denote what’s the future prospective of
the organization and how they can make it even better and can plan better for the same and
can help them retaining customers in a longer run.
Understanding business/social opportunity
This a case study of a e-commerce company where in they have customers assigned with
unique account ID and a single account ID can hold many customers (like family plan) across
gender and marital status, customers get flexibility in terms of mode of payment they want
to opt for. Customers are again segmented across various types of plans they opt for as per
their usage which also based on the device they use (computer or mobile) moreover they
earn cashbacks on bill payment.
The overall business runs in customers loyalty and stickiness which in-turn comes from
providing quality and value-added services. Also, running various promotional and festivals
offers may help organization in getting new customers and also retaining the old one.
We can conclude that a customer retained is a regular income for organization, a customer
added is a new income for organization and a customers lost will be a negative impact as a
single account ID holds multiple number of customers i.e.; closure of one account ID means
loosing multiple customers.
It’s a great opportunity for the company as it’s a need of almost every individual of family to
have a DTH connection which in-turn also leads to increase and competition. Question arises
how can a company creates difference when compared to other competitors, what are the
parameter plays a vital role having customers loyalty and making them stay. All these social
responsibilities will decide the best player in the market.
P a g e 3 | 43
2. EDA and Business Implication
Data Report
P a g e 5 | 43
Table 3: - Describing Dataset
Except variables “AccountID”, “Churn”, “rev_growth_yoy” and
“coupon_used_for_payment” all other variables have null values present.
P a g e 6 | 43
Exploratory data analysis
Univariate analysis (distribution and spread for every continuous attribute,
distribution of data in categories for categorical ones)
Univariate Analysis: -
The variable shows outlier in data, which needs to be treated in further steps.
None of the variables shows normal distribution and are skewed in nature.
P a g e 7 | 43
P a g e 8 | 43
Fig 2: - Count plot of categorical variables
P a g e 9 | 43
Fig 3: - Strip plot of across variables
P a g e 10 | 43
Average service score given by a customer for the service provided is around “3”
which shows the area of improvements.
Most of the customers are into “Super+” segment and least number of customers
are into “Regular” segment.
Most of the customers availing servives are “Married”.
Most of the customer perfer “Mobile” as the device to avail services.
Bi-variate Analysis: -
Pair plot across all categorical data and its impact towards the target variable.
The pair-plot shown above indiates that the independent variable are week
or poor predictors of target variable as we the density of independent
variable overlaps with the density of target variable.
P a g e 11 | 43
P a g e 12 | 43
Fig 5: - Contribution of categorical variable towards chrun
City_tier “1” has shown maximum churning when compared with “2” and
“3”.
Customer with prefreffred mode of payment as “debit card” and “credit
card” are more prone to churn.
Customers with gender as “Male” are showing more Churn ratio as compared
to female.
Customers into “Regular Plus” segment showing more churn.
Single customers are more tend to churn when compared with divorced and
married.
Customers using the service over mobile shows more churn.
Correlation among variable:-
We have performed correlation between variables after treatuing bad data and
missing values. We have also converted into interger data types to check on
correlation as data type as categorical wont show in the pictures below.
P a g e 13 | 43
Inferences from correlation: -
Variable “Tenure” shows high co-relation with Churn.
Variable “Marital Status” shows high co-relation with churn.
Variable “complain_ly” shows high- correlation with churn.
Removal of unwanted variables: - After in-depth understanding of data we conclude that
removal of variables is not required at this stage of project. We can remove the variable
“AccountID” which denotes a unique ID assigned to unique customers. However, removing
them will lead to 8 duplicate rows. Rest all the variables looks important looking at the
univariate and bi-variate analysis.
3. Data Cleaning and Pre-processing
Outlier treatment: -
This dataset is the mix of continuous as well as categorical variables. It doesn’t make any
sense if we perform outlier treatment on categorical variable as each category denotes a
type of customer. So, we are performing outlier treatment only for variables continuous in
nature.
We have 8 continuous variables in the dataset namely, “Tenure”,
“CC_Contacted_LY”, “Account_user_count”, “cashback”, “rev_per_month”,
“Day_Since_CC_connect”, “coupon_used_for_payment” and “rev_growth_yoy”.
We have used upper limit and lower limit to remove outliers. Below is the pictorial
representation of variables before and after outlier treatment.
Before After
P a g e 14 | 43
P a g e 15 | 43
Fig 7: - Before and after outlier treatement
We look at the unique observations in the variable and see that we have “#” and
“nan” present in the data. Where “#” is a anomaly and “nan” represents null value.
We look at the unique observations in the variable and see presence of null value as
shown below.
P a g e 16 | 43
Fig 10: - before treatment
we are replacing “nan” with calculated mode of the variable and now we don’t see
any presence of null values.
Converted data type to integer, because IDE has recognized it as object data type
due presence of bad data.
we are replacing “nan” with calculated Median of the variable and now we don’t see
any presence of null values.
Converted data type to integer, because IDE has recognized it as object data type
due presence of bad data.
P a g e 17 | 43
Fig 15: - after treatment
We look at the unique observations in the variable and see presence of null value as
shown below.
P a g e 18 | 43
Treating Variable “account_segment”
We look at the unique observations in the variable and see presence of null value as
well different denotations for the same type of observations, shown below.
Replacing “nan” with calculated Mode of the variable and now we don’t see any
presence of bad data and null values.
Then converting them to integer data type as it will be used for further model
building.
We look at the unique observations in the variable and see presence of null value as
shown below.
Replacing “nan” with calculated Mode of the variable and also labelled the
observations. Where in 1 = Single, 2 = Divorced and 3 = Married and now we don’t
see any presence of bad data and null values.
Then converting them to integer data type as it will be used for further model
building.
Replacing “+” with “nan” and further we replace “nan” with calculated median of the
variable and now we don’t see any presence of bad data and null values.
Then converting them to integer data type as it will be used for further model
building.
Replacing “nan” with calculated Mode of the variable and now we don’t see any
presence of null values.
Then converting them to integer data type as it will be used for further model
building.
Replacing “$” with “nan” and further we replace “nan” with calculated median of the
variable and now we don’t see any presence of bad data and null values.
P a g e 20 | 43
Then converting them to integer data type as it will be used for further model
building.
Replacing “$”, “*” and “#” with “nan” and further we replace “nan” with calculated
median of the variable and now we don’t see any presence of bad data and null
values.
Then converting them to integer data type as it will be used for further model
building.
Replacing “$” with “nan” and further we replace “nan” with calculated median of the
variable and now we don’t see any presence of bad data and null values.
Then converting them to integer data type as it will be used for further model
building.
P a g e 21 | 43
Replacing “$” with “nan” and further we replace “nan” with calculated median of the
variable and now we don’t see any presence of bad data and null values.
Then converting them to integer data type as it will be used for further model
building.
Replacing “&&&&” with “nan” and further we replace “nan” with calculated Mode of
the variable. Also, labelling the observations where in 1= Mobile and 2 = Computer
and now we don’t see any presence of bad data and null values.
Then converting them to integer data type as it will be used for further model
building.
Variable transformation: -
We see that the different variable has different dimensions. Like variable “Cashback”
denotes currency where as “CC_Agent_Score” denotes rating provided by the
customers. Due to which they differ in their statistical rating as well.
Scaling would be required for this data set which in turn will normalize the date and
standard deviation will be close to “0”.
Using MinMax scalar to perform normalization of data.
Standard Deviation Before and After Normalization: -
Before After
We see that the standard deviation of variables are now close to “0”.
Also converted variables to int data type which will help in further model building process.
P a g e 23 | 43
4. Model Building
From the above visual and non-visual analysis, we can conclude that it’s a case of
classification model, where in the target variable needs to be classified into “Yes” or “No”.
As a data analyst we have below algorithms to build the desired mechanism to predict if a
given customer will churn or not: -
P a g e 24 | 43
Splitting Data into Train and Test Dataset: -
Following the accepted market practice, we have divided data into Train and Test dataset
into 70:30 ratio and building various models on training dataset and testing for accuracy
over testing dataset.
Below is the shape of Train and Test dataset: -
P a g e 25 | 43
Fig 45: - Before and after SMOTE
The increase in density of the orange dots indicates the increase in data points.
Approach is to observe model performance across various algorithms with their default
parameters, then to check on model performance with tuning into different hyper-
parameters. Also, will observe model performance on balanced data-set to check if that
out performs over imbalanced data-set.
After building various models and analysing various parameters we conclude that “KNN”
with default values out performs all other models built. We have concluded this basis
Accuracy, F1 score, Recall, Precision and AUC score.
Post splitting data into training and testing data set we fitted KNN model into training dataset and
performed prediction on training and testing dataset using the same model. We made the first model with
default hyperparameters with default value of n_neighbour as “5”.
P a g e 26 | 43
Below are the classification Report obtained from this model: -
Below are the AUC scores and ROC curves obtained from this model: -
we can observe that the cross validations scores are almost same for all the folds. Which indicates that
the model built is correct.
It’s very important to have the right value of n_neighbour’s to fetch the best accuracy
from the model. We can decide on the best value for n_neighbours based on MSE (mean
squared error) scores. The value with least score of MSE indicated least error and will
fetch the best optimized n_neighbours value.
P a g e 27 | 43
Below are the MSE scores: -
Below is the graphical version of of MSE scores acorss numerous values of n_neighbors.
From the above plotted graph we can see the n_neighbors with value “5” gives the least MSE score.
With which we can proceed and build KNN model with n_neighbor value as “5” which is also the
default n_neighbor. Hence, different model building with correct number of n_nrighbor is not
required as it’s the same as default value if n_neighbor.
After building the model with its default values as shown above, we will try and find the best hyper
parameters to check if we can outperform the accuracy achieved by the model built with default
values of hyperparameter. From GridSearchCV we found that the best parameters are “ball-tree” as
algorithm, “Manhattan” as metrics, “5” as n_neighbors and “distance” as weights.
Below are the accuracy scores obtained from this model using GridSearchCV: -
P a g e 28 | 43
Below are the confusion matrices obtained from this model using GridSearchCV: -
Below is the classification report obtained from this model using GridSearchCV: -
Below are the AUC scores and ROC curves obtained from this model using GridSearchCV: -
Fig 56: - ROC Curve and AUC Score From KNN with Hyperparamter Tuning
Fig 57: Cross Validation Scores From KNN with Hyperparamter Tuning
we can observe that the cross validations scores are almost same for all the folds. Which indicates that
the model built is correct.
P a g e 29 | 43
Building model using SMOTE: -
From above descriptive analysis we can conclude that the original data provided is imbalance in
nature and by using SMOTE technique we can balance the data to check if the model can
outperform when data is balanced. We have applied SMOTE technique to oversample the data
and to obtain a balanced dataset.
Below are the AUC scores and ROC curves obtained from balanced dataset: -
Fig 61: ROC Curve and AUC Scores From KNN with SMOTE
P a g e 30 | 43
Below are the 10-fold cross validation scores: -
we can observe that the cross validations scores are almost same for all the folds. Which
indicates that the model built is correct.
P a g e 31 | 43
Overall Model Building Comparison across parameters: -
P a g e 32 | 43
5. Model Validation
When it comes to model validation of a classification problem statement we cannot just
get relied on accuracy, we need to look at various others parameter like F1 score, Recall,
precision, ROC curve and AUC score, along with confusion matrix. The details of these
parameters are described below: -
Confusion Matrix:
Confusion Matrix usually causes a lot of confusion even in those who are using them
regularly. Terms used in defining a confusion matrix are TP, TN, FP, and FN.
True Positive (TP): - The actual is positive in real and at the same time the prediction
was classified correctly.
False Positive (FP): - The actual was actually negative but was falsely classified as
positive.
True Negative: - The actuals were actually negative and was also classified as negative
which is the right thing to do.
False Negative: - The actuals were actually positive but was falsely classified as negative.
P a g e 33 | 43
Recall = TP/ (TN + FN)
Specificity: Out of all the real negative cases, how many were identified as negative.
P a g e 34 | 43
6. Final interpretation / recommendation
Insights From Analysis: -
From the above tabular representation of all the scores for training and testing
dataset across various model we can conclude that the KNN model with default
values of hyper-parameters is best optimized for the given dataset. (highlighted in
BOLD)
There is marginal difference in accuracy for Logistic regression and LDA, but
comparatively LDA had a little better performance than logistic regression.
Model with bagging and boosting is also well optimized but difference in accuracy
for training and testing dataset is little on the higher side as compared to KNN.
Other model’s namely Naïve Bayes, LDA and SVM worked well on training dataset
but the accuracy came down when performed over testing dataset. Which indicates
overfitting of data in that model.
All models built on balances dataset showed overfitting.
We also understand that the accuracy and other measuring parameter of a model
can be improved by trying various other combinations of hyper-parameter. Model
building is an iterative process. Model performance both on training and testing
dataset can be improves
P a g e 35 | 43
Recommendations: -
Four Stages of Churn Management
Preventing Customers from attrition through analyzing various churn signals and
triggers
Focus on saving customers from leaving the firm through several campaigns.
Other Recommendations: -
Business can introduce referral drive for existing customers to acquire new
customers.
Business can be in joint with other life style vendors to provide vouchers to the
new as well existing loyal customers.
Business can internally bifurcate its customers based on spending pattern into
deal seeker, tariff optimizer etc. and can have different acquisition strategy for
each set of customers.
P a g e 36 | 43
Specialized team of customer service for Top notch customers to avoid waiting
time and better customer experience and interaction.
Understanding customers profile and sending small token of gift on special days.
Thanking customers with hand written notes on invoices will create a good will
factor.
Business needs to make sure that all complaints and queries raised are resolve
on time.
Business can promote using their own e-wallet as payment option by giving
certain discount over the bill.
Business needs to come up with subsidized offers for customers who are single
as they show high trend to churn.
Business needs to introduce all-in-one family plan with extra services, it will
make accessibility easier for customers.
Business can promote payment via standing instruction in bank account or UPI
which can be hassle free and safe for customers.
Fig 64: Graphical representation of customers division basis spending and loyalty
P a g e 37 | 43
Customers under “High Loyalty & Low Spending” can be offered with bundled
family floater plan to increase on their spending’s.
The 4th quadrant can be the major area of observation for business where in
customers are “low on loyalty but high on spending’s”, they can be retained by
increasing the service level index and with proper follow-up on running offers
and subscriptions.
Appendix
Codes Snapshot: -
P a g e 38 | 43
P a g e 39 | 43
P a g e 40 | 43
P a g e 41 | 43
Hyperparameters: -
P a g e 42 | 43
----------------------------------------END OF REPORT-------------------------------------------
P a g e 43 | 43