[go: up one dir, main page]

0% found this document useful (0 votes)
39 views43 pages

Reference Report 2

Uploaded by

mac56128
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views43 pages

Reference Report 2

Uploaded by

mac56128
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

CAPSTONE PROJECT - CUSTOMER

CHURN

Final Report - Prepared By: - Abhay Ankit

P a g e 1 | 43
Content of Report

SL Content Page
NO. Number
1 Introduction To Business Problem 3
2 EDA and Business Implications 4-14
3 Data Cleaning & Pre-Processing 14-23
4 Model Building & Improve Model Performance 24-31
5 Model Comparison 32
6 Model Validation 33-34
7 Final Interpretation & Recommendations 35-38
8 Appendix 38-43

P a g e 2 | 43
1. Introduction of the Business Problem

Problem Statement: -
An E-commerce company is facing a lot of competition in the current market and it has
become a challenge to retain the existing customers in the current competitive situation.
Hence, the DTH company wants to develop a model through which they can do churn
prediction of the accounts and provide segmented offers to the potential churners. In this
company, account churn is a major thing because 1 account can have multiple customers.
hence by losing one account, the company might be losing more than one customer.
we have been assigned to develop a churn prediction model for this company and provide
business recommendations on the campaign. The model or campaign has to be unique and
has to sharp when offers are suggested. The offers suggested should have a win-win situation
for company as well as customers so that company doesn’t hit on revenue and on the other
hand able to retain the customers.
Need of the study/project
This study/project is very essential for the business to plan for future in terms of product
designing, sales or in rolling out different offers for different segment of clients. The outcome
of this project will give a clear understanding where the firm stands now and what’s the
capacity it holds in terms for taking risk. It will also denote what’s the future prospective of
the organization and how they can make it even better and can plan better for the same and
can help them retaining customers in a longer run.
Understanding business/social opportunity
This a case study of a e-commerce company where in they have customers assigned with
unique account ID and a single account ID can hold many customers (like family plan) across
gender and marital status, customers get flexibility in terms of mode of payment they want
to opt for. Customers are again segmented across various types of plans they opt for as per
their usage which also based on the device they use (computer or mobile) moreover they
earn cashbacks on bill payment.
The overall business runs in customers loyalty and stickiness which in-turn comes from
providing quality and value-added services. Also, running various promotional and festivals
offers may help organization in getting new customers and also retaining the old one.
We can conclude that a customer retained is a regular income for organization, a customer
added is a new income for organization and a customers lost will be a negative impact as a
single account ID holds multiple number of customers i.e.; closure of one account ID means
loosing multiple customers.
It’s a great opportunity for the company as it’s a need of almost every individual of family to
have a DTH connection which in-turn also leads to increase and competition. Question arises
how can a company creates difference when compared to other competitors, what are the
parameter plays a vital role having customers loyalty and making them stay. All these social
responsibilities will decide the best player in the market.
P a g e 3 | 43
2. EDA and Business Implication
Data Report

Dataset of problem: - Customer Churn Data


Data Dictionary: -
 AccountID -- account unique identifier
 Churn -- account churn flag (Target Variable)
 Tenure -- Tenure of account
 City_Tier -- Tier of primary customer's city
 CC_Contacted_L12m -- How many times all the customers of the account has
contacted customer care in last 12months
 Payment -- Preferred Payment mode of the customers in the account
 Gender -- Gender of the primary customer of the account
 Service_Score -- Satisfaction score given by customers of the account on service
provided by company
 Account_user_count -- Number of customers tagged with this account
 account_segment -- Account segmentation on the basis of spend
 CC_Agent_Score -- Satisfaction score given by customers of the account on customer
care service provided by company
 Marital_Status -- Marital status of the primary customer of the account
 rev_per_month -- Monthly average revenue generated by account in last 12 months
 Complain_l12m -- Any complaints has been raised by account in last 12 months
 rev_growth_yoy -- revenue growth percentage of the account (last 12 months vs last
24 to 13 month)
 coupon_used_l12m -- How many times customers have used coupons to do the
payment in last 12 months
 Day_Since_CC_connect -- Number of days since no customers in the account has
contacted the customer care
 cashback_l12m -- Monthly average cashback generated by account in last 12 months
 Login_device -- Preferred login device of the customers in the account
Data Ingestion: -
Loaded the required packages, set the work directory and load the datafile.
Data set has 11,260 number of observations and 19 variables (18 independent and 1
dependent or target variable).

Table 1 – glimpse of data-frame head with top 5 rows


P a g e 4 | 43
Understanding how data was collected in terms of time, frequency and methodology
 data has been collected for random 11,260 unique account ID, across gender and
marital status.
 Looking at variables “CC_Contacted_L12m”, “rev_per_month”, “Complain_l12m”,
“rev_growth_yoy”, “coupon_used_l12m”, “Day_Since_CC_connect” and “cashback_l12m”
we can conclude that the data has been collected for last 12 month.
 Data has 19 variables, 18 independent and 1 dependent or the target variable, which
shows if customer churned or not.
 The data is the combination of services customers are using along with their
payment option and also then basic individuals details as well.
 Data is mixed of categorical as well as continuous variables.
Visual inspection of data (rows, columns, descriptive details)
 Data has 11,260 rows and 19 variables.

Table 2:- Dataset Information

Fig 1:- Shape of dataset


 Describing data: - This shows description of variation in various statistical
measurements across variables which denotes that each variable is unique and
different.

P a g e 5 | 43
Table 3: - Describing Dataset
 Except variables “AccountID”, “Churn”, “rev_growth_yoy” and
“coupon_used_for_payment” all other variables have null values present.

Table 4: - Showing Null Values in Dataset

 Data has “NIL” duplicate observations.


 With above understanding of data renaming of any of the variable is not required.
 we can move towards the EDA part where in we will understand the data little better
along with treating bad data, null values and outliers.

P a g e 6 | 43
Exploratory data analysis
Univariate analysis (distribution and spread for every continuous attribute,
distribution of data in categories for categorical ones)
Univariate Analysis: -
 The variable shows outlier in data, which needs to be treated in further steps.

Table 5: - Showing Outliers in data

 None of the variables shows normal distribution and are skewed in nature.

Table 6:- Showing skewness and kurtosis

P a g e 7 | 43
P a g e 8 | 43
Fig 2: - Count plot of categorical variables

P a g e 9 | 43
Fig 3: - Strip plot of across variables

Inferences from count plot: -


 Maximum customers are form city tire type “1”, which indicates the high number of
population density in this city type.
 Maximum number of customers perfer debit and credit card as their preferred mode
of payment.
 The ratio of male customers are higher when compared to female.

P a g e 10 | 43
 Average service score given by a customer for the service provided is around “3”
which shows the area of improvements.
 Most of the customers are into “Super+” segment and least number of customers
are into “Regular” segment.
 Most of the customers availing servives are “Married”.
 Most of the customer perfer “Mobile” as the device to avail services.

Bi-variate Analysis: -
 Pair plot across all categorical data and its impact towards the target variable.

fig 4: - pairplot across categorical variables

 The pair-plot shown above indiates that the independent variable are week
or poor predictors of target variable as we the density of independent
variable overlaps with the density of target variable.

P a g e 11 | 43
P a g e 12 | 43
Fig 5: - Contribution of categorical variable towards chrun

 City_tier “1” has shown maximum churning when compared with “2” and
“3”.
 Customer with prefreffred mode of payment as “debit card” and “credit
card” are more prone to churn.
 Customers with gender as “Male” are showing more Churn ratio as compared
to female.
 Customers into “Regular Plus” segment showing more churn.
 Single customers are more tend to churn when compared with divorced and
married.
 Customers using the service over mobile shows more churn.
Correlation among variable:-
We have performed correlation between variables after treatuing bad data and
missing values. We have also converted into interger data types to check on
correlation as data type as categorical wont show in the pictures below.

Fig 6: - Correlation among variables

P a g e 13 | 43
Inferences from correlation: -
 Variable “Tenure” shows high co-relation with Churn.
 Variable “Marital Status” shows high co-relation with churn.
 Variable “complain_ly” shows high- correlation with churn.
Removal of unwanted variables: - After in-depth understanding of data we conclude that
removal of variables is not required at this stage of project. We can remove the variable
“AccountID” which denotes a unique ID assigned to unique customers. However, removing
them will lead to 8 duplicate rows. Rest all the variables looks important looking at the
univariate and bi-variate analysis.
3. Data Cleaning and Pre-processing

Outlier treatment: -
This dataset is the mix of continuous as well as categorical variables. It doesn’t make any
sense if we perform outlier treatment on categorical variable as each category denotes a
type of customer. So, we are performing outlier treatment only for variables continuous in
nature.
 We have 8 continuous variables in the dataset namely, “Tenure”,
“CC_Contacted_LY”, “Account_user_count”, “cashback”, “rev_per_month”,
“Day_Since_CC_connect”, “coupon_used_for_payment” and “rev_growth_yoy”.
 We have used upper limit and lower limit to remove outliers. Below is the pictorial
representation of variables before and after outlier treatment.
Before After

P a g e 14 | 43
P a g e 15 | 43
Fig 7: - Before and after outlier treatement

Missing Value treatment and variable transformation: -


 Out of 19 variables we have data anomalies present in 17 variable and null values in
15 variables.
 Using “Median” to impute null values where variable is continuous in nature because
Median is less prone to outliers when compared with mean.
 Using “Mode: to impute null values where variables are categorical in nature.
 We have treated null values variable by variable as each and every variable is unique
in its nature.
Treating Variable “Tenure”

 We look at the unique observations in the variable and see that we have “#” and
“nan” present in the data. Where “#” is a anomaly and “nan” represents null value.

Fig 8: - before treatment


 Replacing “#” with “nan” and further we replace “nan” with calculated median of the
variable and now we don’t see any presence of bad data and null values.
 Converted data type to integer, because IDE has recognized it as object data type
due presence of bad data.

Fig 9: - after treatment

Treating Variable “City_Tier”

 We look at the unique observations in the variable and see presence of null value as
shown below.

P a g e 16 | 43
Fig 10: - before treatment

 we are replacing “nan” with calculated mode of the variable and now we don’t see
any presence of null values.
 Converted data type to integer, because IDE has recognized it as object data type
due presence of bad data.

Fig 11: - after treatment

Treating Variable “CC_Contacted_LY”


 We look at the unique observations in the variable and see presence of null value as
shown below.

Fig 12: - before treatment

 we are replacing “nan” with calculated Median of the variable and now we don’t see
any presence of null values.
 Converted data type to integer, because IDE has recognized it as object data type
due presence of bad data.

Fig 13: - after treatment

Treating Variable “Payment”


 We look at the unique observations in the variable and see presence of null value as
shown below.

Fig 14: - before treatment


 we are replacing “nan” with calculated Mode of the variable and now we don’t see
any presence of null values.
 Also performed label encoding for the observations. Where 1 = Debit card, 2 = UPI, 3
= credit card, 4 = cash on delivery and 5 = e-wallet. Then converting them to integer
data type as it will be used for further model building.

P a g e 17 | 43
Fig 15: - after treatment

Treating Variable “Gender”


 We look at the unique observations in the variable and see presence of null value
and multiple abbreviations of the same observations as shown below.

Fig 16: - before treatment


 we are replacing “nan” with calculated Mode of the variable and now we don’t see
any presence of null values.
 Also performed label encoding for the observations. Where 1 = Female card and 2 =
Male. Then converting them to integer data type as it will be used for further model
building.

Fig 17: - after treatment

Treating Variable “Service_Score”

 We look at the unique observations in the variable and see presence of null value as
shown below.

Fig 18: - before treatment


 we are replacing “nan” with calculated Mode of the variable and now we don’t see
any presence of null values.
 Then converting them to integer data type as it will be used for further model
building.

Fig 19: - after treatment

Treating Variable “Account_user_count”


 We look at the unique observations in the variable and see presence of null value as
well “@” as bad data, shown below.

Fig 20: - before treatment


 Replacing “@” with “nan” and further we replace “nan” with calculated median of
the variable and now we don’t see any presence of bad data and null values.
 Then converting them to integer data type as it will be used for further model
building.

Fig 21: - after treatment

P a g e 18 | 43
Treating Variable “account_segment”
 We look at the unique observations in the variable and see presence of null value as
well different denotations for the same type of observations, shown below.

Fig 22: - before treatment


 Replacing “nan” with calculated Mode of the variable and also labelled different
account segments, where in 1 = Super, 2 = Regular Plus, 3 = Regular, 4 = HNI and 5 =
Super Plus and now we don’t see any presence of bad data and null values.
 Then converting them to integer data type as it will be used for further model
building.

Fig 23: - after treatment

Treating Variable “CC_Agent_Score”


 We look at the unique observations in the variable and see presence of null value as
shown below.

Fig 24: - before treatment

 Replacing “nan” with calculated Mode of the variable and now we don’t see any
presence of bad data and null values.
 Then converting them to integer data type as it will be used for further model
building.

Fig 25: - after treatment

Treating Variable “Marital_Status”

 We look at the unique observations in the variable and see presence of null value as
shown below.

Fig 26: - before treatment

 Replacing “nan” with calculated Mode of the variable and also labelled the
observations. Where in 1 = Single, 2 = Divorced and 3 = Married and now we don’t
see any presence of bad data and null values.
 Then converting them to integer data type as it will be used for further model
building.

Fig 27: - after treatment


P a g e 19 | 43
Treating Variable “rev_per_month”
 We look at the unique observations in the variable and see presence of null value as
well as presence of “+” which denoted bad data. shown below.

Fig 28: - before treatment

 Replacing “+” with “nan” and further we replace “nan” with calculated median of the
variable and now we don’t see any presence of bad data and null values.
 Then converting them to integer data type as it will be used for further model
building.

Fig 29: - after treatment

Treating Variable “Complain_ly”


 We look at the unique observations in the variable and see presence of null value as
shown below.

Fig 30: - before treatment

 Replacing “nan” with calculated Mode of the variable and now we don’t see any
presence of null values.
 Then converting them to integer data type as it will be used for further model
building.

Fig 31: - after treatment

Treating Variable “rev_growth_yoy”


 We look at the unique observations in the variable and see presence of “$” which
denoted bad data. shown below.

Fig 32: - before treatment

 Replacing “$” with “nan” and further we replace “nan” with calculated median of the
variable and now we don’t see any presence of bad data and null values.
P a g e 20 | 43
 Then converting them to integer data type as it will be used for further model
building.

Fig 33: - after treatment

Treating Variable “coupon_used_for_payment”


 We look at the unique observations in the variable and see presence of “$”, “*” and
“#” which denoted bad data. shown below.

Fig 34: - before treatment

 Replacing “$”, “*” and “#” with “nan” and further we replace “nan” with calculated
median of the variable and now we don’t see any presence of bad data and null
values.
 Then converting them to integer data type as it will be used for further model
building.

Fig 35: - after treatment

Treating Variable “Day_Since_CC_connect”


 We look at the unique observations in the variable and see presence of “$” which
denoted bad data and also the presence of null values. shown below.

Fig 36: - before treatment

 Replacing “$” with “nan” and further we replace “nan” with calculated median of the
variable and now we don’t see any presence of bad data and null values.
 Then converting them to integer data type as it will be used for further model
building.

Fig 37: - after treatment

Treating Variable “cashback”


 We look at the unique observations in the variable and see presence of “$” which
denoted bad data and also the presence of null values. shown below.

Fig 38: - before treatment

P a g e 21 | 43
 Replacing “$” with “nan” and further we replace “nan” with calculated median of the
variable and now we don’t see any presence of bad data and null values.
 Then converting them to integer data type as it will be used for further model
building.

Fig 39: - after treatment

Treating Variable “Login_device”


 We look at the unique observations in the variable and see presence of “&&&&”
which denoted bad data and also the presence of null values. shown below.

Fig 40: - before treatment

 Replacing “&&&&” with “nan” and further we replace “nan” with calculated Mode of
the variable. Also, labelling the observations where in 1= Mobile and 2 = Computer
and now we don’t see any presence of bad data and null values.
 Then converting them to integer data type as it will be used for further model
building.

Fig 41: - after treatment

Count of null values before and after treatment


Before After

Fig 42: - Before and after null value treatment


P a g e 22 | 43
 We see NIL null values across variable which indicated that the data is now cleaned and we
can move further for data transformation of required.

Variable transformation: -
 We see that the different variable has different dimensions. Like variable “Cashback”
denotes currency where as “CC_Agent_Score” denotes rating provided by the
customers. Due to which they differ in their statistical rating as well.
 Scaling would be required for this data set which in turn will normalize the date and
standard deviation will be close to “0”.
 Using MinMax scalar to perform normalization of data.
Standard Deviation Before and After Normalization: -
Before After

Fig 43: - Before and after Normalization

 We see that the standard deviation of variables are now close to “0”.
 Also converted variables to int data type which will help in further model building process.

Addition of new variables: -


At the current stage we don’t see to create any new variable as such. May be required at
further stage of model building and can be created accordingly.

P a g e 23 | 43
4. Model Building
From the above visual and non-visual analysis, we can conclude that it’s a case of
classification model, where in the target variable needs to be classified into “Yes” or “No”.
As a data analyst we have below algorithms to build the desired mechanism to predict if a
given customer will churn or not: -

 Logistic Regression -- Logistic Regression is a “Supervised machine learning”


algorithm that can be used to model the probability of a certain class or event. It is
used when the data is linearly separable and the outcome is binary or dichotomous
in nature. That means Logistic regression is usually used for Binary classification
problems.
 Linear Discriminant Analysis (LDA) -- Linear Discriminant Analysis, or LDA for short,
is a predictive modelling algorithm for multi-class classification. It can also be used as
a dimensionality reduction technique, providing a projection of a training dataset
that best separates the examples by their assigned class.
 KNN -- KNN works by finding the distances between a query and all the examples in
the data, selecting the specified number examples (K) closest to the query, then votes
for the most frequent label (in the case of classification) or averages the labels (in the
case of regression).
 Naïve Bayes -- Naive Bayes is a kind of classifier which uses the Bayes Theorem. It
predicts membership probabilities for each class such as the probability that given
record or data point belongs to a particular class. The class with the highest
probability is considered as the most likely class.
 Bagging (Random Forest) -- Bagging, also known as bootstrap aggregation, is the
ensemble learning method that is commonly used to reduce variance within a noisy
dataset. In bagging, a random sample of data in a training set is selected with
replacement—meaning that the individual data points can be chosen more than
once.
 Ada- Boosting -- The process of boosting involves improving the power of a machine
learning program by adding more complex or capable algorithms. This process can
reduce both bias and variance in machine learning, which helps to create more
effective results.
 Gradient Boosting -- Gradient boosting is a type of machine learning boosting. It
relies on the intuition that the best possible next model, when combined with
previous models, minimizes the overall prediction error. If a small change in the
prediction for a case causes no change in error, then next target outcome of the case
is zero
 Support Vector Machine (SVM) -- SVM is a supervised machine learning algorithm
which can be used for classification or regression problems. It uses a technique
called the kernel trick to transform your data and then based on these
transformations it finds an optimal boundary between the possible outputs.

P a g e 24 | 43
Splitting Data into Train and Test Dataset: -
Following the accepted market practice, we have divided data into Train and Test dataset
into 70:30 ratio and building various models on training dataset and testing for accuracy
over testing dataset.
Below is the shape of Train and Test dataset: -

Fig 44: - Shape of training and test dataset

Treating im-balance nature of data


 Dataset provided is imbalance in nature. The categorical count of our target variable
“Churn” shows high variation in counts. We have count of “0” as 9364 and count of
“1” as 1896.

Table 7: - Imbalanced dataset


 This imbalance in dataset can be performed using SMOTE technique will generates
additional datapoints to balance the data.
 We need to apply SMOTE only on to train dataset not on test dataset. divided data
into train and test dataset in 70:30 ratio as an accepted market practice (can be
changed later as instructed).

Before SMOTE After SMOTE

Table 8: - Before and after SMOTE

P a g e 25 | 43
Fig 45: - Before and after SMOTE
 The increase in density of the orange dots indicates the increase in data points.

Approach is to observe model performance across various algorithms with their default
parameters, then to check on model performance with tuning into different hyper-
parameters. Also, will observe model performance on balanced data-set to check if that
out performs over imbalanced data-set.
After building various models and analysing various parameters we conclude that “KNN”
with default values out performs all other models built. We have concluded this basis
Accuracy, F1 score, Recall, Precision and AUC score.

BUILDING KNN MODEL: -


Building model with default hyperparameters: -

Post splitting data into training and testing data set we fitted KNN model into training dataset and
performed prediction on training and testing dataset using the same model. We made the first model with
default hyperparameters with default value of n_neighbour as “5”.

Below are the accuracy scores obtained from this model: -

Fig 46: - Accuracy Scores From KNN

Below are the confusion matrices obtained from this model: -

Fig 47: - Confusion Matrix From KNN

P a g e 26 | 43
Below are the classification Report obtained from this model: -

Fig 48: - Classification Report From KNN

Below are the AUC scores and ROC curves obtained from this model: -

Fig 49: - ROC Curve and AUC Scores From KNN

Below are the 10-fold cross validation scores: -

Fig 50: - Cross Validation Scores From KNN

we can observe that the cross validations scores are almost same for all the folds. Which indicates that
the model built is correct.

Effort to improve model performance.

Find the right value of n_neighbor: -

It’s very important to have the right value of n_neighbour’s to fetch the best accuracy
from the model. We can decide on the best value for n_neighbours based on MSE (mean
squared error) scores. The value with least score of MSE indicated least error and will
fetch the best optimized n_neighbours value.

P a g e 27 | 43
Below are the MSE scores: -

Fig 51: - MSE Scores

Below is the graphical version of of MSE scores acorss numerous values of n_neighbors.

Fig 52: - Graphical Version Of MSE Score

From the above plotted graph we can see the n_neighbors with value “5” gives the least MSE score.
With which we can proceed and build KNN model with n_neighbor value as “5” which is also the
default n_neighbor. Hence, different model building with correct number of n_nrighbor is not
required as it’s the same as default value if n_neighbor.

Building model using GridSearchCV and getting the best hyperparameters: -

After building the model with its default values as shown above, we will try and find the best hyper
parameters to check if we can outperform the accuracy achieved by the model built with default
values of hyperparameter. From GridSearchCV we found that the best parameters are “ball-tree” as
algorithm, “Manhattan” as metrics, “5” as n_neighbors and “distance” as weights.

Below are the accuracy scores obtained from this model using GridSearchCV: -

Fig 53: - Accuracy From KNN with Hyperparamter Tuning

P a g e 28 | 43
Below are the confusion matrices obtained from this model using GridSearchCV: -

Fig 54: - Confusion Matrix From KNN with Hyperparamter Tuning

Below is the classification report obtained from this model using GridSearchCV: -

Fig 55: - Classification Report From KNN with Hyperparamter Tuning

Below are the AUC scores and ROC curves obtained from this model using GridSearchCV: -

Fig 56: - ROC Curve and AUC Score From KNN with Hyperparamter Tuning

Below are the 10-fold cross validation scores: -

Fig 57: Cross Validation Scores From KNN with Hyperparamter Tuning

we can observe that the cross validations scores are almost same for all the folds. Which indicates that
the model built is correct.

P a g e 29 | 43
Building model using SMOTE: -

From above descriptive analysis we can conclude that the original data provided is imbalance in
nature and by using SMOTE technique we can balance the data to check if the model can
outperform when data is balanced. We have applied SMOTE technique to oversample the data
and to obtain a balanced dataset.

Below are the accuracy scores obtained from balanced dataset: -

Fig 58: Accuracy Score From KNN with SMOTE

Below are the confusion matrices obtained from balanced dataset: -

Fig 59: Confusion Matrix From KNN with SMOTE

Below are the classification reports obtained from balanced dataset: -

Fig 60: Classification Reports From KNN with SMOTE

Below are the AUC scores and ROC curves obtained from balanced dataset: -

Fig 61: ROC Curve and AUC Scores From KNN with SMOTE

P a g e 30 | 43
Below are the 10-fold cross validation scores: -

Fig 62: Cross Validation Scores From KNN with SMOTE

we can observe that the cross validations scores are almost same for all the folds. Which
indicates that the model built is correct.

Inference/Conclusion from KNN Model: -


From the above we can conclude that the data is neither “Overfit” nor “Underfit” in nature. And
we can also inference that the model built using grid search CV is best optimized model for
prediction. However, we can see significant variations in accuracy score, F1 score, recall values,
precision values, ROC curves and AUC scores when compared with default values of KNN and
also with model built on balanced dataset. Model built on balance data set using SMOTE
technique works well in training dataset however we can see a significant deplete in accuracy
when it comes to testing dataset.

Inferences on final model: -


 From the above tabular representation of all the scores for training and testing
dataset across various model we can conclude that the KNN model with default
values of hyper-parameters is best optimized for the given dataset. (Highlighted in
BOLD)
 There is marginal difference in accuracy for Logistic regression and LDA, but
comparatively LDA had a little better performance than logistic regression.
 Model with bagging and boosting is also well optimized but difference in accuracy
for training and testing dataset is little on the higher side as compared to KNN.
 Other model’s namely Naïve Bayes, LDA and SVM worked well on training dataset
but the accuracy came down when performed over testing dataset. Which indicates
overfitting of data in that model.
 All models built on balances dataset showed overfitting.
 We also understand that the accuracy and other measuring parameter of a model
can be improved by trying various other combinations of hyper-parameter. Model
building is an iterative process. Model performance both on training and testing
dataset can be improves
Similar approach like above have been used to create other models and below is the model
comparison across parameter. However, we have showcased the result of KNN as we have
concluded that it out performs all other models.

P a g e 31 | 43
Overall Model Building Comparison across parameters: -

Table 9: Comparison Across Various Models

Indicators/symbols for above tabular data: -


 CV : - indicates scores for model built on best params obtained from GridSearchCV
with model name as prefix.
 SM: - indicates scores for model built on balanced dataset with model name as
prefix.
 KNN-5: - Indicates KNN model built with N_neighbors as “5”.

Implication of final model on Business: -


 Using the model built above business can plan various strategies to make customers
stick with them.
 They can roll out different Offers and discounts as family floater.
 They can give regular discount coupons if paid by their e-wallet platform.
 Discount vouchers of other vendors or on next bill can be provided based in
minimum bill criteria.
 This model gives business an idea where they stand currently and what best they can
do to improve on the same.

P a g e 32 | 43
5. Model Validation
When it comes to model validation of a classification problem statement we cannot just
get relied on accuracy, we need to look at various others parameter like F1 score, Recall,
precision, ROC curve and AUC score, along with confusion matrix. The details of these
parameters are described below: -

Fig :-63 – A look of confusion matrix

Confusion Matrix:
Confusion Matrix usually causes a lot of confusion even in those who are using them
regularly. Terms used in defining a confusion matrix are TP, TN, FP, and FN.
True Positive (TP): - The actual is positive in real and at the same time the prediction
was classified correctly.
False Positive (FP): - The actual was actually negative but was falsely classified as
positive.
True Negative: - The actuals were actually negative and was also classified as negative
which is the right thing to do.
False Negative: - The actuals were actually positive but was falsely classified as negative.

Various Components Of Classification Report: -


Accuracy: This term tells us how many right classifications were made out of all the
classifications.
Accuracy = (TP + TN) / (TP + FP +TN + FN)
Precision: Out of all that were marked as positive, how many are actually truly positive.
Precision = TP / (TP + FP)
Recall or Sensitivity: Out of all the actual real positive cases, how many were identified
as positive.

P a g e 33 | 43
Recall = TP/ (TN + FN)
Specificity: Out of all the real negative cases, how many were identified as negative.

Specificity = TN/ (TN + FP)


F1-Score: As we saw above, sometimes we need to give weightage to FP and sometimes
to FN. F1 score is a weighted average of Precision and Recall, which means there is equal
importance given to FP and FN. This is a very useful metric compared to “Accuracy”. The
problem with using accuracy is that if we have a highly imbalanced dataset for training
(for example, a training dataset with 95% positive class and 5% negative class), the model
will end up learning how to predict the positive class properly and will not learn how to
identify the negative class. But the model will still have very high accuracy in the test
dataset too as it will know how to identify the positives really well.
F1 score = 2* (Precision * Recall) / (Precision + Recall)
Area Under Curve (AUC) and ROC Curve: AUC or Area Under Curve is used in conjecture
with ROC Curve which is Receiver Operating Characteristics Curve. AUC is the area under
the ROC Curve. So, let’s first understand the ROC Curve.
A ROC Curve is drawn by plotting TPR or True Positive Rate or Recall or Sensitivity (which
we saw above) in the y-axis against FPR or False Positive Rate in the x-axis. FPR = 1-
Specificity (which we saw above).
TPR = TP/ (TP + FN)
FPR = 1 – TN/ (TN+FP) = FP/ (TN + FP)
when we want to select the best model, we want a model that is closest to the perfect
model. In other words, a model with AUC close to 1. When we say a model has a high
AUC score, it means the model’s ability to separate the classes is very high (high
separability). This is a very important metric that should be checked while selecting a
classification model.

P a g e 34 | 43
6. Final interpretation / recommendation
Insights From Analysis: -

 Business have visibility in tier-1 city.


 Mostly customer rated “3” for the services provided by the business.
 Mostly customer rated “3” for the interactions they have customer care
representatives.
 Transaction via UPI and e-wallet is very low.
 Maximum churn is from the account segment “Regular+”.
 Customers with marital status is “single” contributes max towards churn.
 Any complaints raised in last 12 months doesn’t show any impact toward churn.
 Tenure and cashback are directly proportional to each other.
 Computer usage is more in tier 1 city followed by tier 3 and tier 2 city.
Insights From Model Building: -

 From the above tabular representation of all the scores for training and testing
dataset across various model we can conclude that the KNN model with default
values of hyper-parameters is best optimized for the given dataset. (highlighted in
BOLD)
 There is marginal difference in accuracy for Logistic regression and LDA, but
comparatively LDA had a little better performance than logistic regression.
 Model with bagging and boosting is also well optimized but difference in accuracy
for training and testing dataset is little on the higher side as compared to KNN.
 Other model’s namely Naïve Bayes, LDA and SVM worked well on training dataset
but the accuracy came down when performed over testing dataset. Which indicates
overfitting of data in that model.
 All models built on balances dataset showed overfitting.
 We also understand that the accuracy and other measuring parameter of a model
can be improved by trying various other combinations of hyper-parameter. Model
building is an iterative process. Model performance both on training and testing
dataset can be improves

P a g e 35 | 43
Recommendations: -
 Four Stages of Churn Management

 Introduce pre-defined customer segmentation and according to customers’


needs and usage.
 Acquiring customers through different strategies

 Delighting customers in order to increase customers loyalty as compared to


competitors.

 Preventing Customers from attrition through analyzing various churn signals and
triggers

 Focus on saving customers from leaving the firm through several campaigns.

Other Recommendations: -

 Business can introduce referral drive for existing customers to acquire new
customers.

 Business can be in joint with other life style vendors to provide vouchers to the
new as well existing loyal customers.

 Business can internally bifurcate its customers based on spending pattern into
deal seeker, tariff optimizer etc. and can have different acquisition strategy for
each set of customers.

 Offering free cloud storage to loyal customers.

 Customized email response to priority customers basis segmentation for better


customer interaction.

P a g e 36 | 43
 Specialized team of customer service for Top notch customers to avoid waiting
time and better customer experience and interaction.

 Understanding customers profile and sending small token of gift on special days.

 Thanking customers with hand written notes on invoices will create a good will
factor.

 Follow-up in customers issues and taking regular feedbacks on the same.

 Conducting satisfaction survey to understand change in customers behavior.

 Business needs to make sure that all complaints and queries raised are resolve
on time.

 Business can promote using their own e-wallet as payment option by giving
certain discount over the bill.

 Business needs to come up with subsidized offers for customers who are single
as they show high trend to churn.

 Business needs to introduce all-in-one family plan with extra services, it will
make accessibility easier for customers.

 Business needs to increase in visibility in Tier-2 city for better customer


acquisition.

 Business can promote payment via standing instruction in bank account or UPI
which can be hassle free and safe for customers.

Customer Segment Approach: -

Fig 64: Graphical representation of customers division basis spending and loyalty

 Customers can divide into 4 sets as shown in figure.


 And at times business needs to take harsh decision of letting go the customers
with “low on loyalty and Low on spending”.
 Customers under set of “High Loyalty & High Spending” can be retained by
delighting them with various offers.

P a g e 37 | 43
 Customers under “High Loyalty & Low Spending” can be offered with bundled
family floater plan to increase on their spending’s.
 The 4th quadrant can be the major area of observation for business where in
customers are “low on loyalty but high on spending’s”, they can be retained by
increasing the service level index and with proper follow-up on running offers
and subscriptions.

Appendix
Codes Snapshot: -

P a g e 38 | 43
P a g e 39 | 43
P a g e 40 | 43
P a g e 41 | 43
Hyperparameters: -

P a g e 42 | 43
----------------------------------------END OF REPORT-------------------------------------------

P a g e 43 | 43

You might also like