[go: up one dir, main page]

100% found this document useful (1 vote)
120 views58 pages

Predictive Modelling

1. The document describes a predictive modeling assignment to predict customer churn for a telecom company. It includes an initial discovery of the business problem, hypotheses, available data variables, and an exploratory data analysis. 2. The exploratory data analysis includes univariate analysis of individual variable distributions and bivariate analysis of relationships between dependent and independent variables. Key insights found non-normal distributions for categorical variables and outliers for monthly charge. Most other continuous predictors were normally distributed. 3. The goal is to build a model using past customer data to predict whether future customers will cancel their service or not, in order to reduce customer churn.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
120 views58 pages

Predictive Modelling

1. The document describes a predictive modeling assignment to predict customer churn for a telecom company. It includes an initial discovery of the business problem, hypotheses, available data variables, and an exploratory data analysis. 2. The exploratory data analysis includes univariate analysis of individual variable distributions and bivariate analysis of relationships between dependent and independent variables. Key insights found non-normal distributions for categorical variables and outliers for monthly charge. Most other continuous predictors were normally distributed. 3. The goal is to build a model using past customer data to predict whether future customers will cancel their service or not, in order to reduce customer churn.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

PREDICTIVE MODELLING ASSIGNMENT

-PREDICTION OF CUSTOMER CHURN

By:
Pranav Viswanathan

1
1. Initial Discovery
1.1. Initial analysis
Customer churn is a for Telecom companies. It is essentially an important
factor to determine what holds in the future for the company. Once a
company with an exponential growth for a long period has been left behind.
Customer churn is one of the important aspect which helps to study the
major reason for a customer to leave a particular concern.
One of the hardest problem in this type of situation is recognizing when and
why it occurred. Once a regular and a loyal customer now a customer of
competitive concern.
The learning graph of churn helps to study the pattern of customers leaving
their reasons for leaving and how to balance it with new customers.

1.2. Business problem


In this project, we simulate one such case of customer churn where we
work on a data of postpaid customers with a contract. The data has
information about the customer usage behavior, contract details and the
payment details. The data also indicates which were the customers who
canceled their service. Based on this past data, we need to build a model
which can predict whether a customer will cancel their service in the future
or not.

1.2.1. Data in hand


Variables Description
Churn 1 if customer cancelled the service
0 if not
AccountWeeks No of weeks customer has an active account
ContractRenewal 1 if customer recently renewed the contract
O if not
DataPlan 1 if the customer has Data plan
O if not

2
DataUsage Gigabytes of monthly data usage
CustServCalls Number of calls into customer service
DayMins Average daytime minutes per month
DayCalls Average number of daytime calls
MonthlyCharge Average monthly bill
OverageFee Largest overage fee in last 12 months
RoamMins Average number of roaming minutes

Table 1:Data variables and description

From the above table we can see there are total of 11 variables out of
which 10 are predictors and Churn is the Defendant variable.

1.3. Initial Hypothesis


NULL Hypothesis: (HO) –No predictor is available to predict churn.
ALTERNATE Hypothesis: (HA) – There is atleast one independent
variable to predict churn.

2. Basic Data Preparation


2.1. Setting up Environment ,Importing Libraries and dataset:

This step ensures that:


 Environment is set up with a path .
 Importing various libraries for using certain functions .
 Getting data set for exploration .

2.1.A Setting up Environment


Before the exploration of the given dataset ,we first set up an
environment more precisely where we want to save and take in the data

3
set from .This is done with the help of setwd() which is used to set up
the working environment.

getwd() - is a function which helps to get the location which is set.

FIG 1: Setting up environment

FIG 2: Output-Setting up environment

The above Fig-1 and Fig-2 represnets the step and output of setting up an
environment for working in R studio.

2.1.B Importing Libraries


This step ensures and helps to import libraries which are essential for using
certain functions for data processing.
Libraries have in built functions for various manipulation and statistical
processes.

4
FIG 3: Importing Libraries

Details on certain important Libraries:


corrplot: used for getting correlation plot of the dataset.
ROCR: used for finding the AUC of models.
e1071: For Naïve bayes.
car : For regression purpose and data split,
ggplot: For plotting differents charts,
ineq: For performance metrics .

FIG 4: output of Loaded Libraries

2.1.C Importing Dataset


The dataset is then imported by means of read.csv() which reads the
dataset into the R environment .

5
FIG 5 :Reading dataset

FIG 6 : output of Reading dataset

3.Exploratory Data Analysis – Step by step approach:

In statistics, exploratory data analysis (EDA) is an approach to


analyzing data sets to summarize their main characteristics,
often with visual methods.

The objectives of EDA are to:


 Suggest hypotheses about the causes of observed phenomena.

 Assess assumptions on which statistical inference will be based.

 Support the selection of appropriate statistical tools and techniques.


.
 Provide a basis for further data collection through surveys or
experiments.

6
3.1 Data preparation
3.1.1 Initial data exploration
3.1.1 A.Top and Bottom Rows
Here the data is read and its characteristics like the structure,summary top
and bottom 6 rows are viewed for further analysis.

FIG 7 : Top and Bottom rows dataset

FIG 8 : Output- Top and Bottom rows dataset

From the Fig 7,8 we are able to see the top and bottom 6 rows of the
dataset .But this dosent give proper insights of dataset ,hence we further
explore to get the characteristics if Dataset.

3.1.1 B.Structure of Dataset


Structure is a compact way to display the structure of an R
object. This allows you to use structure as a diagnostic function
and an alternative to summary. Structure will output the

7
information on one line for each basic structure. Structure is best
for displaying contents of lists.

FIG 9: Code- Structure of dataset

FIG 10: Output- Structure of dataset

From the above Fig 10 we see that there are 11 variables with 3333
observations.
we can also see that Churn, ContractRenewal and CustServCalls are
categorical variables with numeric values that is with repeated values.
Churn with 0’s and 1’s
ContractRenewal with 0’s and 1’s
And CustServCalls with 1,2,3,4,5
So it is better to convert them to factors.

3.1.1 C. Variable Conversion

8
Here 3 variables namely Churn, ContractRebewal and CustServCalls are
categorical variables with numeric values that is with repeated values. So
we convert them to factors.

FIG 11: Variable conversion

To check them whether they are converted ,we once again use str() to see
the structure of the dataset.

FIG 12:Output: Variable conversion

3.1.1 D . Statistical parameters of Dataset


Summary is a generic function used to produce result summaries of
the results of various model fitting functions. The function invokes
particular methods which depend on the class of the first argument.

9
FIG 13:Output: Summary 1

For further important statistical parameters we use describe() to see


skewness,range etc

FIG 13:Output: Summary 2

3.1.2 Univariate Analysis:


Histogram and boxplot Distributions of Dataset:

10
In univariate analysis we see the insights of individual variables as a plot of
itself with frequency or count.

FIG 14 :Code : Univariate 1

11
FIG 15 : Output : Univariate 1

INSIGHTS:
 From the histograms we see all the variables are Normally distributed
except the Churn , ContractRenewal and CustServCalls as they are
factor with different levels.
 We see that Monthly charge has a great deal of outliers.
 We further investigate by seeing separate plots of the above
variables.

NUMERICAL DATA

FIG 16 :Code : Univariate 2

12
FIG 17 :Output : Univariate 2

INSIGHTS:
 AccountWeeks is normally distributed and slightly right skewed.
 DataUsage is distributed normally based upon the amount of data
used.
 Other variables are normally distributed along the mean .
3.1.3 Bi-variate Analysis:
INDEPENDENT:
Accountweeks,Custservcalls,Contractrenewal,Datausage,Dataplan,
Daymins,Daycalls,Monthlycharge,Overagefee,Roammins.
DEPENDANT: Churn
Bi-variate analysis gives plots as a function between dependant and
independent variable.

NUMERICAL VARIABLE:

13
FIG 18 :Code : Bivariate 1

FIG 19 :Output : Bivariate 1

INSIGHTS:
 From Histograms almost all continuous prdictors like Account
Weeks, DayMins, OverageFee, Roam mins have normal
distributions.
 Monthly charge has its distribution skewed to a bit left which
can be ignored.
 Customers who churn vs who dont are mostly have similar
distribution for the Account weeks with mean of Churn(1) =
102.6 (~103) Weeks and Not Churn(0) = 100.7(~101) Weeks.

14
 On an Average Customers who Churn are utilizing more Day
Minutes(207 mins) than who don’t (175 mins).

 On the other hand Churning customers data usage (0.54 GB)


on an average is less compared to Non-Churning ones ( 0.86
GB).

15
 Churning Customers call Customer Service more in the bracket
of ( 5 - 10 calls) v/s the bracket of (0-5 Calls).
 Monthly Charges are also more for Churn customers compared
to Non-Churn.]

CATEGORICAL VARIABLE:

FIG 20 Code : Bivariate 2

Here the plot is between Churn and other categorical variables like contract
renewal ,data plan and customer service calls.
Here the plot is a categorical Vs categorical situation and helps to find the
relation between the same for further data analysis.

16
FIG 21 Output : Bivariate 2

3.2 OUTLIER’S ,MISSING VALUE’S and its TREATMENT:


OUTLIERS:
Outliers are basically data points that are far away from the inter quartile
range . An outlier may be due to variability in the measurement or it may
indicate experimental error; the latter are sometimes excluded from
the dataset. An outlier can cause serious problems in statistical analyses.
Outliers can have many anomalous causes. A physical apparatus for taking
measurements may have suffered a transient malfunction. There may have
been an error in data transmission or transcription. Outliers arise due to
changes in system behaviour, fraudulent behaviour, human error,
instrument error or simply through natural deviations in populations.

FIG 22 Code : OUTLIER 1

17
FIG 23 Output : OUTLIER 1

INSIGHTS:
 From the above plot we can see that Datausage has a large number
of outliers followed by Monthly charge and Day Mins.
 For further analysis ,we visualize the variables Data Usage,monthly
charge and Day Mins separately to see its positon and values.
 Data usage has many outliers for the Churners (class 1).
 Day Mins and Monthly Charge has many outliers in the Non-Churner
category (Class 0).

DATAUSAGE:

FIG 22 Code : OUTLIER 2

18
FIG 23 Output : OUTLIER 2

DAYMINS:

FIG 24 Code : OUTLIER 3

FIG 25 Output : OUTLIER 3

19
MONTHLY CHARGE:

FIG 26 Code : OUTLIER 4

FIG 27 Output : OUTLIER 4

INSIGHTS:
 For churn =0 Monthly charge has a lot of outliers
compared to Daymins.
 For churn=1 Datausage has a lot of outliers.
 The best way to treat these outliers is to do scaling
and normalizing methods, so that the model’s
created will be less prone to error and wont be
overfit.

20
MISSING VALUES:
Missing data, or missing values, occur when no data value is stored for the
variable in an observation.
Missing data are a common occurrence and can have a significant effect
on the conclusions that can be drawn from the data.

FIG 28 Code : Missing values

FIG 29 Output : Missing values

INSIGHTS:
from the above figure we see that the data is free from missing
values,hence there is no need for any treatment on this front.

21
3.3 MULTICOLLINEARITY and its TREATMENT:
Multicollinearity occurs when the independent variables of a regression
model are correlated and if the degree of collinearity between the
independent variables is high, it becomes difficult to estimate the
relationship between each independent variable and the dependent
variable and the overall precision of the estimated coefficients.

FIG 30 Code : Corrplot

FIG 31 Output : Corrplot

INSIGHTS:
Data suggests there is very strong correlation between Monthly charges
and data usage which is quite obvious . So we can replace one variable
with another after evaluation.

22
TREATMENT:

FIG 32 Code : Multicollinearity treatment

Data suggests there is very strong correlation between Monthly charges


and data usage which is quite obvious . So we can replace one variable
with another after evaluation.

3.4 Summary from EDA


 From initial analysis we come to know that there are 11
variables with 3333 observations.
 Out of 11 variables 10 are predictors
(Accountweeks,Custservcalls,Contractrenewal,Datausage,Data
plan,Daymins,Daycalls,Monthlycharge,Overagefee,Roammins)
and one is dependant(churn).

 From structure function (str()), we came to know about


the datatype of the variable .

 Churn,ContractRenewal and DataPlan were categorical


with numerical values and they were converted to
factors.

23
 Outliers :Data usage has many outliers for the Churners (class
1) and Day Mins and Monthly Charge has many outliers in the
Non-Churner category (Class 0).
 The dataset had zero missing values.

 Univariate analysis:
 It helped to see the structure of dataset
 Dataset was normally distributed about the mean
 Bi-variate analysis:
 From Histograms almost all continuous prdictors like
Account Weeks, DayMins, OverageFee, Roam mins have
normal distributions.
 Monthly charge has its distribution skewed to a bit left
which can be ignored.
 Customers who churn vs who dont are mostly have
similar distribution for the Account weeks with mean of
Churn(1) = 102.6 (~103) Weeks and Not Churn(0) =
100.7(~101) Weeks.

24
4.LOGISTIC REGRESSION
4.1 LOGISTIC REGRESSION MODEL:

DATA PREPARATION:
 The Dataset is split into 70:30 ratio for creating and predicting
the model.

FIG 33 Code : Splitting the dataset

 Checking the dimension and structure of dataset.

FIG 34 Code : Dimension of the dataset

FIG 35 Output : Split and Dimension of the dataset

25
APPLYING LOGISTIC REGRESSION

FIG 36 Code : LR 1

FIG 37 Output : LR 1

From the above figure we can see that ContractRenewal,DataPlan


,CustServCalls ,OverageFee and RoamMins are significant.
Moreover we need to carry out VIF(Varience Inflation Factor) to further
decide on the important variables to the model.

26
VARIENCE INFLATION FACTOR(VIF):

FIG 38 : VIF

From the above we can see that DataPlan, DayMins, MonthlyCharge


,OverageFee exceed vif >2 .
Hence we carry out chi square test for further analysis.

FIG 39 : Chisq 1

FIG 40 : Chisq 2

27
INSIGHTS:
 From Fig 37 we can see that ContractRenewal,DataPlan
,CustServCalls ,OverageFee and RoamMins are significant.
 From Fig 38 we can see that DataPlan, DayMins,
MonthlyCharge ,OverageFee exceed vif >2 .Hence we carry out
chi square test for further analysis.
 After carrying out Chisq test to predict the significant predictors
we see that OverageFee and DayCalls can be left out of the
model .

4.2 LOGISTIC REGRESSION INTERPRETATION:


From the model created(Fig 37) and Chisq Test(Fig 40) we see that:
 ContractRenewal ,DataPlan ,CustServCalls ,OverageFee and
RoamMins, MonthlyCharge,DayMins, are significant.
 ContractRenewal and DataPlan have a negative influence on Churn.

Let’s find out the power of Odds and Probability of the variables impacting
on Customer Churn.

FIG 41 : Odds Ratio

28
FIG 42 : Probability

Variable Odds Ratio Probability


AccountsWeek 1.00180114 2.00360228
ContractRenewal 0.13653122 0.27306244
DataPlan 0.25005861 0.50011721
CustServCalls 1.71530831 3.43061662
DayMins 1.00965686 2.01931371
DayCalls 1.00354047 2.00708093
MonthlyCharge 1.01558769 2.03117538
OverageFee 1.11489382 2.22978764
RoamMins 1.07731308 2.15462617

Fig 43:Odds ratio-probability table

From the above table,the data points highlighted in yellow have a negative
impact on Churn.

29
4.3 PREDICTION:
Since we have confirmed the importance of additional significant variables,
let’s check performance of our Model using a Classification Table /
Confusion Matrix.

Fig 43:Code :Prediction

Fig 44 :Output :Prediction

CONFUSION MATRIX ON TEST DATASET:

Fig 45 :Code: CM-LR

30
Fig 46 :Output: CM-LR

INTERPRETATION:

 31 out of (31+30) customers were found to be correctly churned out.


This turns out to be the positive prediction rate. (0.58).
 825 out of (825+113) customers were found to be not churned
out.This is the negative prediction rate.(0.87).
 Logistic Regression also performs poorly in case of general model
with positive pred rate of 0.58% and Sensitivity of just 0.21%.
 Of course this model can be improved through better selection of
predictors and their interaction effects but the general case is worst
performer.
 The accuracy seems to be good at 0.85. This creates a paradox as
accuracy is greater with positive prediction rate lesser .

31
4.4 Interpretation of other Model Performance Measures for
logistic <KS, AUC, GINI>
ROC PLOT:
It is a plot of the True Positive Rate against the False Positive Rate for the
different possible cut-points of a diagnostic test.
An ROC curve demonstrates several things:
1. It shows the trade-off between sensitivity and specificity (any increase in
sensitivity will be accompanied by a decrease in specificity).
2. The closer the curve follows the left-hand border and then the top
border of the ROC space, the more accurate the test.
3. The closer the curve comes to the 45-degree diagonal of the ROC
space, the less accurate the test. 4.
The slope of the tangent line at a cut-point gives the likelihood ratio (LR) for
that value of the test.
5. The area under the curve (AUC) is a measure of text accuracy.

Fig 47 :AUC

AUC or Area under the curve is 78% ie dataset has 78.6 % concordant
pairs.

32
Fig 48 :ROC

INTERPRETATION:

 At a threshold of 0.5 , we find that the TPR(True positive rate) is 0.58


and FPR(False positive rate ) to be 0.87.
 So if the threshold is decreased from 0.5 to 0.3 or 0.2 ,most cases fall
under class churn =1, which futhur helps to increase TPR.
 From the plot we see that AUC is around 0.786 which implies that
78.6 concordant pairs in the entire dataset.
 Of course this model can be improved through better selection of
predictors and their interaction effects but the general case is worst
performer.

KS chart and interpretation:

Fig 49 :Code: KS

33
Fig 50 :Output: KS

 The two sample Kolmogorov-Smirnov test is a nonparametric test


that compares the cumulative distributions of churn =0 and churn =1.
 The KS test report the maximum difference between the two
cumulative distributions, and calculates a P value from that and the
sample sizes.
 It is the maximum difference between TPR and NPR and it is
found out to be 52%.

GINI and interpretation:


 Gini is measured in values between 0 and 1, where a score of
1 means that the model is 100% accurate in predicting the
outcome.
 A higher Gini is beneficial to the bottom line because requests can be
assessed more accurately, which means acceptance can be
increased and at less risk.

34
Gini= AUC*2-1 ; AUC=0.78
=(0.78*2)-1
=0.56

We see that the Gini is 0.56 which is moderately adequate inequality if the
threshold is at 0.4.

5.Data Normalization/Scaling for KNN and Naïve bayes:


Data normalization is a process in which data attributes within a data
model are organized to increase the cohesion of entity types. In other
words, the goal of data normalization is to reduce and even eliminate data
redundancy, an important consideration for application developers because
it is incredibly difficult to stores objects in a relational database that
maintains the same information in several places.

Here we use min-max scaling.


Min-max normalisation is often known as feature scaling where the values
of a numeric range of a feature of data, i.e. a property, are reduced to a
scale between 0 and 1. Therefore, in order to calculate z, i.e. the
normalised value of a member of the set of observed values of x, we must
employ the following formula:

35
Fig 51 :Code: Data normalization

Fig 52 :Output: Data normalization

6.Data split for KNN and Naïve Bayes:

Fig 53 :Code: Data Split

36
Fig 54 :Output: Data Split

7. KNN (K Nearest Neighbours):


7.1 Applying KNN :

The KNN or k-nearest neighbors algorithm is one of the simplest machine


learning algorithms and is an example of instance-based learning, where
new data are classified based on stored, labeled instances.

More specifically, the distance between the stored data and the new
instance is calculated by means of some kind of a similarity measure. This
similarity measure is typically expressed by a distance measure such as
the Euclidean distance, cosine similarity or the Manhattan distance.

In other words, the similarity to the data that was already in the system is
calculated for any new data point that you input into the system.

37
Fig 55 :Code: KNN

INTERPRETATION:
trainControl: Control the computational nuances of the train function.
repeatedcv: Cross validation method
repeats: No of times the cross validation to take place

Optimum method for finding the value of K:

Fig 56 :Output: method to find K

38
 From Fig 56 ,we can see that the train dataset is cross
validated in 10 fold and repeated 3 times.
 Resampling of the dataset gives the optimum value of K with
a accuracy rate of 90.28%.

7.2 INTERPRETATION-KNN MODEL:

Fig 57 :Code: KNN-Prediction

Fig 58 :Output: KNN-Prediction

From the above prediction we can see that for k=9 the predictions
of churn =0 is 920 meaning that customers won’t leave and for
churn =1 is 79 which means they churn out.

39
CONFUSION MATRIX INTERPRETATION:

Fig 59 :Code: KNN-CM

Fig 60 :Output: KNN-CM

 Positive prediction rate is found to be 0.73 which is found to be less


compared to accuracy rate of 0.89
 Negative prediction rate is found to be 0.90
 Of course this model can be improved through better selection of
predictors and their interaction effects but the general case is worst
performer.
40
 The accuracy seems to be good at 0.89 This creates a paradox as
accuracy is greater with positive prediction rate lesser .

8. Naïve Bayes:
8.1 Applying Naïve Bayes:
 Naïve Bayes classifiers are a family of simple "probabilistic
classifiers" based on applying Bayes' theorem with strong
(naïve) independence assumptions between the features. They are
among the simplest Bayesian network models.
 It is based upon Bayes theorem

 Using Bayes theorem, we can find the probability of A happening,


given that B has occurred. Here, B is the evidence and A is the
hypothesis. The assumption made here is that the predictors/features
are independent. That is presence of one particular feature does not
affect the other. Hence it is called naive.

BUILDING NAÏVE BAYES MODEL:


 The problem states that whether Naïve bayes model can be built with
the given Dataset or not. The given Dataset consistes of both
Categorical as well as numerical Variables.
 We know that Naïve Bayes classifer works well on only Categorical
values .
 Navie Bayes works best with Categorical values but can be made to
work on mix datasets having continuous as well as categorical
variables as predictors like in cellphone dataset.

41
 Since this algo runs on Conditional Probabilities it becomes very hard
to silo the continous variables as they have no frequency but a
continuum scale.

 Moreover, The model can be created with a mixture of both


Categorical and numerical values but the accuracy is less than what
when created with only Categorical values.

Fig 61 :Code: Naïve Bayes

Fig 62 :Output: Naïve Bayes

42
8.2 INTERPRETATION:
 Navie Bayes works best with Categorical values but can be made to
work on mix datasets having continuous as well as categorical
variables as predictors like in cellphone dataset.
 Since this algorithm runs on Conditional Probabilities it becomes very
hard to silo the continous variables as they have no frequency but a
continuum scale.
 For continous variables: what NB does is takes their mean and
standard deviation or variability and treats it as cut off thresholds ;
say anything less than mean of distributed predictor values is 0 and
more than mean is 1.
 Above law suits binary classifier ; however if we have multinomial
Response categories than it will have to go for quantiles, deciles n-
iles partitioning the data accordingly and assigning them the
probabilities.
 Based on above NB’s working on mixed dataset and its accuracy is
always questionable.
 Its findings and predictions need to be supported by other Classifiers
before any actionable operations
 The Output for the NB model displays in the matrix format for each
predictor its mean [,1] and std deviation [,2] for class 1 and class 0.
 The independence of predictors (no-multicollinearity) has been
assumed for sake of simplicity.

43
CONFUSION MATRIX INTERPRETATION:

Fig 63 :Code: Naïve Bayes-CM

Fig 634:Output: Naïve Bayes-CM

INTERPRETATION:
Positive prediction value: 62.3%
Negative prediction value: 89.1%
Accuracy: 87.9%

44
9.CONFUSION MATRIX INTERPRETATION OF ALL MODELS:

Model
Parameter Logistic KNN(K Naïve Bayes
Regression Nearest
Neighbor’s)
Accuracy 0.8569 0.8929 0.8729

Positive 0.5082 0.7341 0.6231


prediction
Negative 0.8795 0.9065 0.8914
prediction
Sensitivity 0.2152 0.4027 0.2986
Specificity 0.9649 0.9754 0.9695

From the above table we can see that KNN model has a good
Accuracy(0.89) with positive prediction of (0.73) which seems
better when compared to other models in real time scenarios.

10.INSIGHTS OF EVERY MODEL’s(LR,KNN,NB) VALIDATION:


 From the building and prediction of various model’s like Logistic
regression, KNN, Naïve Bayes we are able to see that KNN has a
good accuracy rate of 89.2% and with a positive prediction of about
73.4% when compared to Other model’s like Naïve Bayes and logistic
regression.

45
 Of course this model (KNN) can be improved through better selection
of predictors and their interaction effects but the general case is worst
performer.
 In Case of Logistic Regression , LR model also suffers from accuracy
paradox such that if threshold probability is decreses from 0.5 to say
0.2 or 0.1 then more cases will fall in Churner category (1).
 Logistic Regression also performs poorly in case of general model
with positive prediction rate of 50.8% and Sensitivity of just 21.52%.
 Navie Bayes works best with Categorical values but can be made to
work on mix datasets having continuous as well as categorical
variables as predictors like in cellphone dataset.
 Since this algorithm runs on Conditional Probabilities it becomes very
hard to silo the continous variables as they have no frequency but a
continuum scale.
 For continous variables what NB does is takes their mean and
standard deviation or variability and treats it as cut off thresholds ;
say anything less than mean of distributed predictor values is 0 and
more than mean is 1.

11.ACTIONABLE INSIGHTS AND RECOMMENDATIONS:

For Naïve Bayes:


 Firstly all variables need to be categorical in nature .
 Secondly if continuous variables are present proper methods
are needed to taken into action to normalize and scale them.
 Based on above NB’s working on mixed dataset and its
accuracy is always questionable.
 Naive Bayes has no parameters to tune.

46
For KNN:
 k-NN performs the best with Positive pred rate of 81% in
the general case model where the formula intends to take all
the 10 predictors irrespective of their type whether continous or
categorical.
 Furthur the model can be tuned to get good prediction and
accuracy.
For Logistic Regression:
 For logistic regression ,all variables need to be independent
of each other.
 The model can be tuned by performance metrics for better
predictions .

12.CONCLUSION:
MODEL POSITIVE ACCURACY
PREDICTION
Logistic Regression 50.8% 85.6%
KNN 73.4% 89.2%
Naïve Bayes 62.3% 87.2%

 k-NN performs the best with Positive pred rate of 73.4% in the
general case model where the formula intends to take all the 10
predictors irrespective of their type whether continous or categorical.
 The intended or any refined / tuned target model should be able to
catch the Churners based on the data provided . Ofcourse the
dataset is lopsided in favor of more-NonChurners rather than our
intended target of finding Churners based on their behavior hidden in
the dataset.

47
 Naive Bayes has no parameters to tune , but k-NN and Logit Regr
can be improved by fine tuning the train control parameters and also
deploying the up/down sampling approach for Logistic regression to
counteract the class imbalance.

48
APPENDIX

#### Setting up environment

setwd('C:\\Users\\Viswanathan\\Desktop\\pgp-babi')

getwd()

#### Importing the important Libraries

library(DataExplorer)

library(readxl)

library(corrplot)

library(caTools)

library(gridExtra)

library(rpart)

library(rpart.plot)

library(randomForest)

library(data.table)

library(ROCR)

library(ineq)

library(InformationValue)

library(caret)

library(e1071)

library(car)

library(caret)

library(class)

library(devtools)

library(e1071)

library(ggplot2)

library(Hmisc)

library(klaR)

library(MASS)

library(nnet)

49
library(plyr)

library(pROC)

library(psych)

library(scatterplot3d)

library(SDMTools)

library(dplyr)

library(ElemStatLearn)

library(neuralnet)

library(rms)

library(gridExtra)

#### Reading in the files

Data <-read.csv('Cellphone.csv',header = TRUE,sep=',')

attach(Data)

#### To see the first 6 and last 6 data

head(Data)

tail(Data)

#### Basic data summary

describe(Data)

summary(Data)

str(Data)

#### Converting Churn,Contract Renewal and Data plan as factors

Data$Churn=as.factor(Data$Churn)

Data$ContractRenewal=as.factor(Data$ContractRenewal)

Data$DataPlan=as.factor(Data$DataPlan)

str(Data)

#### Checking for missing data and removing them

any(is.na(Data))

50
colSums(is.na(Data))

anyNA(Data)

#### Uni-variate Analysis

#### Histogram distribution of dataset

par(mfrow=c(3,3))

plot_histogram(Data) #Numerical

#Categorical

plot_bar(Churn)

plot_bar(ContractRenewal)

plot_bar(DataPlan)

names(Data)

par(mar=c(2,2,2,2))

par(mfrow=c(4,4))

hist(Churn,xlab='Churn')

boxplot(Churn,horizontal = TRUE,main='Boxplot of churn',xlab='Churn')

hist(AccountWeeks,xlab='AccountWeeks')

boxplot(AccountWeeks,horizontal = TRUE,main='Boxplot of Accountweeks',xlab='Accountweeks')

hist(ContractRenewal,xlab='Contract renewal')

boxplot(ContractRenewal,horizontal = TRUE,main='Boxplot of Contract renewal',xlab='Contract renewal')

hist(DataPlan,xlab='Data plan')

boxplot(DataPlan,horizontal = TRUE,main='Boxplot of Dataplan',xlab='Dataplan')

hist(DataUsage,xlab='Datausage')

boxplot(DataUsage,horizontal = TRUE,main='Boxplot of Data usage',xlab='Data usage')

hist(CustServCalls,xlab='Custservcalls')

boxplot(CustServCalls,horizontal = TRUE,main='Boxplot of Custservcall',xlab='Custservcall')

hist(DayMins,xlab='Daymins')

boxplot(DayMins,horizontal = TRUE,main='Boxplot of Daymins',xlab='Daymins')

hist(DayCalls,xlab='Daycalls')

boxplot(DayCalls,horizontal = TRUE,main='Boxplot of Daycalls',xlab='Daycalls')

51
hist(MonthlyCharge,xlab='Monthlycharge')

boxplot(MonthlyCharge,horizontal = TRUE,main='Boxplot of monthly charge',xlab='Monthly charge')

hist(OverageFee,xlab='overagefee')

boxplot(OverageFee,horizontal = TRUE,main='Boxplot of overagefee',xlab='Overagefee')

hist(RoamMins,xlab='RoamMins')

boxplot(RoamMins,horizontal = TRUE,main='Boxplot of Roam mins',xlab='Roam mins')

#### Density plots of variables

plot_density(Data,geom_density_args = list(fill='gold',alpha=0.4))

#### Bivariate

library(gridExtra)

p1 = ggplot(Data, aes(AccountWeeks, fill=Churn)) + geom_density(alpha=0.4)

p2 = ggplot(Data, aes(MonthlyCharge, fill=Churn)) + geom_density(alpha=0.4)

p3 = ggplot(Data, aes(CustServCalls, fill=Churn))+geom_bar(position = "dodge")

p4 = ggplot(Data, aes(RoamMins, fill=Churn)) + geom_histogram(bins = 50, color=c("red"))

grid.arrange(p1,p2,p3,p4)

### In depth analysis of Bi-variate:

## AccountWeeks Vs Churn

d1=Data$AccountWeeks[Data$Churn==1]

mean(d1)

d2=Data$AccountWeeks[Data$Churn==0]

mean(d2)

## DayMinutes Vs Churn

d3=Data$DayMins[Data$Churn==1]

mean(d3)

d4=Data$DayMins[Data$Churn==0]

mean(d4)

## DayMinutes Vs Churn

52
d5=Data$DataUsage[Data$Churn==1]

mean(d5)

d6=Data$DataUsage[Data$Churn==0]

mean(d6)

names(Data)

#### Categorical

p6=ggplot(Data,aes(x=ContractRenewal))+geom_bar(aes(fill=Churn))

p7=ggplot(Data,aes(x=DataPlan))+geom_bar(aes(fill=Churn))

p8=ggplot(Data,aes(x=CustServCalls))+geom_bar(aes(fill=Churn))

grid.arrange(p6,p7,p8)

#### Outlier and its treatment

plot_boxplot(Data,by='Churn',geom_boxplot_args = list('outlier.color'='red',fill='blue'))

## Data usage has many outliers for churn(class 1)

## Day mins and Monthly charge has outliers for churn(class 0)

outlier1=boxplot(DataUsage)$out

print(outlier1)

which(DataUsage %in% outlier1)

outlier2=boxplot(DayMins)$out

print(outlier2)

which(DayMins %in% outlier2)

outlier3=boxplot(MonthlyCharge)$out

print(outlier3)

which(MonthlyCharge %in% outlier3)

53
#### Insight for Multicollinearity

cell_numeric=Data %>% select_if(is.numeric)

a=round(cor(cell_numeric),2)

corrplot(a)

##Data suggests there is very strong correlation between Monthly charges

#and data usage which is quite obvious .

##So we can replace one variable with another after evaluation

names(Data)

### VIF

lr=read.csv('Cellphone.csv',header = TRUE,sep=',')

LR=lm(Churn~., data=lr)

summary(LR)

vif(LR)

### Removing Datausage

Data=Data[,-5]

str(Data)

#### Splitting dataset - Train and Test

set.seed(332)

split=createDataPartition(Data$Churn,p=0.7,list=FALSE)

train_Data=Data[split,]

test_Data=Data[-split,]

dim(train_Data)

dim(test_Data)

#### LOGISTIC REGRESSION ####

Model1=glm(train_Data$Churn~., data=train_Data,family='binomial')

summary(Model1)

## Contract renewal and data plan has negative impact on customer churn

54
## checking for varience inflation factor

vif(Model1)

?anova

### CHI SQUARED TEST-to check significant predictors

anova(Model1,test='Chisq')

#Dataplan,Daymins,Monthlycharge need to be cured as vif is greater than 5

Model1$coefficients

## Likelihood ratio

lh=exp(Model1$coefficients)

print(lh) ##odds ratio of accountweek is 1.0018,one unit increase

#leads to 0.0018 increase in churn

prob=exp(coef(Model1))/1+exp(coef(Model1))

prob

## Interpretation

Model1_pred = predict(Model1,newdata = test_Data,type='response')

Model1_predicted=ifelse(Model1_pred>0.5,1,0)

#Factor conversion

Model1_predicted_factor=factor(Model1_predicted,levels = c(0,1))

head(Model1_predicted_factor)

## Confusion matrix

Model1.CM=confusionMatrix(Model1_predicted_factor,test_Data$Churn,positive='1')

Model1.CM

## ROC curve

LR_pred=predict(Model1,newdata = test_Data,type='response')

55
rocr_pred=prediction(LR_pred,test_Data$Churn)

perf=performance(rocr_pred,'tpr','fpr')

plot(perf)

plot(perf,colorize=TRUE,print.cutoffs.at=seq(0,1,0.05),text.adj=c(-0.2,1.7))

as.numeric(performance(rocr_pred,'auc')@y.values)

## KS

library(blorr)

ks=blr_gains_table(Model1)

blr_ks_chart(ks,title='KS chart',ks_line_color='black')

##GINI

LR_gini=Gini(LR_pred,Churn)

LR_gini

names(Data)

##### NORMALIZED DATA FOR KNN AND NAIVE BAYES #####

Data1=read.csv('Cellphone.csv',header = TRUE,sep=',')

norm=function(x){(x-min(x))/(max(x)-min(x))}

data_normalised=as.data.frame(lapply(Data1[,-1],norm))

library(tibble)

view(data_normalised)

data=cbind(Data1[,1],data_normalised)

str(data)

data$`Data1[, 1]`=as.factor(data$`Data1[, 1]`)

data$ContractRenewal=as.factor(data$ContractRenewal)

data$DataPlan=as.factor(data$DataPlan)

str(data)

#### Splitting dataset - Train and Test

set.seed(332)

split=createDataPartition(data$`Data1[, 1]`,p=0.7,list=FALSE)

56
Train_Data=data[split,]

Test_Data=data[-split,]

dim(Train_Data)

dim(Test_Data)

str(Train_Data)

attach(Train_Data)

### KNN ###

set.seed(2020)

ctrl=trainControl(method='repeatedcv',repeats = 3)

knn.fit=train(`Data1[, 1]`~.,data=Train_Data,method='knn',trControl=ctrl

,preProcess=c('center','scale'),tuneLength=10)

knn.fit

Model2=knn(Train_Data[,-1],Test_Data[,-1],Train_Data[,1],k=9)

summary(Model2)

knn_table=table(Test_Data[,1],Model2)

knn_table

sum(diag(knn_table)/sum(knn_table))

knn.cm=confusionMatrix(Model2,Test_Data$`Data1[, 1]`,positive='1')

knn.cm

#### Naive bayes ####

NB.fit=naiveBayes(Train_Data$`Data1[, 1]`~.,data=Train_Data)

NB.fit

NB.pred=predict(NB.fit,Test_Data)

NB.pred

57
NB.cm=confusionMatrix(NB.pred,Test_Data$`Data1[, 1]`,positive='1')

NB.cm

58

You might also like