Predictive Modelling
Predictive Modelling
By:
Pranav Viswanathan
1
1. Initial Discovery
1.1. Initial analysis
Customer churn is a for Telecom companies. It is essentially an important
factor to determine what holds in the future for the company. Once a
company with an exponential growth for a long period has been left behind.
Customer churn is one of the important aspect which helps to study the
major reason for a customer to leave a particular concern.
One of the hardest problem in this type of situation is recognizing when and
why it occurred. Once a regular and a loyal customer now a customer of
competitive concern.
The learning graph of churn helps to study the pattern of customers leaving
their reasons for leaving and how to balance it with new customers.
2
DataUsage Gigabytes of monthly data usage
CustServCalls Number of calls into customer service
DayMins Average daytime minutes per month
DayCalls Average number of daytime calls
MonthlyCharge Average monthly bill
OverageFee Largest overage fee in last 12 months
RoamMins Average number of roaming minutes
From the above table we can see there are total of 11 variables out of
which 10 are predictors and Churn is the Defendant variable.
3
set from .This is done with the help of setwd() which is used to set up
the working environment.
The above Fig-1 and Fig-2 represnets the step and output of setting up an
environment for working in R studio.
4
FIG 3: Importing Libraries
5
FIG 5 :Reading dataset
6
3.1 Data preparation
3.1.1 Initial data exploration
3.1.1 A.Top and Bottom Rows
Here the data is read and its characteristics like the structure,summary top
and bottom 6 rows are viewed for further analysis.
From the Fig 7,8 we are able to see the top and bottom 6 rows of the
dataset .But this dosent give proper insights of dataset ,hence we further
explore to get the characteristics if Dataset.
7
information on one line for each basic structure. Structure is best
for displaying contents of lists.
From the above Fig 10 we see that there are 11 variables with 3333
observations.
we can also see that Churn, ContractRenewal and CustServCalls are
categorical variables with numeric values that is with repeated values.
Churn with 0’s and 1’s
ContractRenewal with 0’s and 1’s
And CustServCalls with 1,2,3,4,5
So it is better to convert them to factors.
8
Here 3 variables namely Churn, ContractRebewal and CustServCalls are
categorical variables with numeric values that is with repeated values. So
we convert them to factors.
To check them whether they are converted ,we once again use str() to see
the structure of the dataset.
9
FIG 13:Output: Summary 1
10
In univariate analysis we see the insights of individual variables as a plot of
itself with frequency or count.
11
FIG 15 : Output : Univariate 1
INSIGHTS:
From the histograms we see all the variables are Normally distributed
except the Churn , ContractRenewal and CustServCalls as they are
factor with different levels.
We see that Monthly charge has a great deal of outliers.
We further investigate by seeing separate plots of the above
variables.
NUMERICAL DATA
12
FIG 17 :Output : Univariate 2
INSIGHTS:
AccountWeeks is normally distributed and slightly right skewed.
DataUsage is distributed normally based upon the amount of data
used.
Other variables are normally distributed along the mean .
3.1.3 Bi-variate Analysis:
INDEPENDENT:
Accountweeks,Custservcalls,Contractrenewal,Datausage,Dataplan,
Daymins,Daycalls,Monthlycharge,Overagefee,Roammins.
DEPENDANT: Churn
Bi-variate analysis gives plots as a function between dependant and
independent variable.
NUMERICAL VARIABLE:
13
FIG 18 :Code : Bivariate 1
INSIGHTS:
From Histograms almost all continuous prdictors like Account
Weeks, DayMins, OverageFee, Roam mins have normal
distributions.
Monthly charge has its distribution skewed to a bit left which
can be ignored.
Customers who churn vs who dont are mostly have similar
distribution for the Account weeks with mean of Churn(1) =
102.6 (~103) Weeks and Not Churn(0) = 100.7(~101) Weeks.
14
On an Average Customers who Churn are utilizing more Day
Minutes(207 mins) than who don’t (175 mins).
15
Churning Customers call Customer Service more in the bracket
of ( 5 - 10 calls) v/s the bracket of (0-5 Calls).
Monthly Charges are also more for Churn customers compared
to Non-Churn.]
CATEGORICAL VARIABLE:
Here the plot is between Churn and other categorical variables like contract
renewal ,data plan and customer service calls.
Here the plot is a categorical Vs categorical situation and helps to find the
relation between the same for further data analysis.
16
FIG 21 Output : Bivariate 2
17
FIG 23 Output : OUTLIER 1
INSIGHTS:
From the above plot we can see that Datausage has a large number
of outliers followed by Monthly charge and Day Mins.
For further analysis ,we visualize the variables Data Usage,monthly
charge and Day Mins separately to see its positon and values.
Data usage has many outliers for the Churners (class 1).
Day Mins and Monthly Charge has many outliers in the Non-Churner
category (Class 0).
DATAUSAGE:
18
FIG 23 Output : OUTLIER 2
DAYMINS:
19
MONTHLY CHARGE:
INSIGHTS:
For churn =0 Monthly charge has a lot of outliers
compared to Daymins.
For churn=1 Datausage has a lot of outliers.
The best way to treat these outliers is to do scaling
and normalizing methods, so that the model’s
created will be less prone to error and wont be
overfit.
20
MISSING VALUES:
Missing data, or missing values, occur when no data value is stored for the
variable in an observation.
Missing data are a common occurrence and can have a significant effect
on the conclusions that can be drawn from the data.
INSIGHTS:
from the above figure we see that the data is free from missing
values,hence there is no need for any treatment on this front.
21
3.3 MULTICOLLINEARITY and its TREATMENT:
Multicollinearity occurs when the independent variables of a regression
model are correlated and if the degree of collinearity between the
independent variables is high, it becomes difficult to estimate the
relationship between each independent variable and the dependent
variable and the overall precision of the estimated coefficients.
INSIGHTS:
Data suggests there is very strong correlation between Monthly charges
and data usage which is quite obvious . So we can replace one variable
with another after evaluation.
22
TREATMENT:
23
Outliers :Data usage has many outliers for the Churners (class
1) and Day Mins and Monthly Charge has many outliers in the
Non-Churner category (Class 0).
The dataset had zero missing values.
Univariate analysis:
It helped to see the structure of dataset
Dataset was normally distributed about the mean
Bi-variate analysis:
From Histograms almost all continuous prdictors like
Account Weeks, DayMins, OverageFee, Roam mins have
normal distributions.
Monthly charge has its distribution skewed to a bit left
which can be ignored.
Customers who churn vs who dont are mostly have
similar distribution for the Account weeks with mean of
Churn(1) = 102.6 (~103) Weeks and Not Churn(0) =
100.7(~101) Weeks.
24
4.LOGISTIC REGRESSION
4.1 LOGISTIC REGRESSION MODEL:
DATA PREPARATION:
The Dataset is split into 70:30 ratio for creating and predicting
the model.
25
APPLYING LOGISTIC REGRESSION
FIG 36 Code : LR 1
FIG 37 Output : LR 1
26
VARIENCE INFLATION FACTOR(VIF):
FIG 38 : VIF
FIG 39 : Chisq 1
FIG 40 : Chisq 2
27
INSIGHTS:
From Fig 37 we can see that ContractRenewal,DataPlan
,CustServCalls ,OverageFee and RoamMins are significant.
From Fig 38 we can see that DataPlan, DayMins,
MonthlyCharge ,OverageFee exceed vif >2 .Hence we carry out
chi square test for further analysis.
After carrying out Chisq test to predict the significant predictors
we see that OverageFee and DayCalls can be left out of the
model .
Let’s find out the power of Odds and Probability of the variables impacting
on Customer Churn.
28
FIG 42 : Probability
From the above table,the data points highlighted in yellow have a negative
impact on Churn.
29
4.3 PREDICTION:
Since we have confirmed the importance of additional significant variables,
let’s check performance of our Model using a Classification Table /
Confusion Matrix.
30
Fig 46 :Output: CM-LR
INTERPRETATION:
31
4.4 Interpretation of other Model Performance Measures for
logistic <KS, AUC, GINI>
ROC PLOT:
It is a plot of the True Positive Rate against the False Positive Rate for the
different possible cut-points of a diagnostic test.
An ROC curve demonstrates several things:
1. It shows the trade-off between sensitivity and specificity (any increase in
sensitivity will be accompanied by a decrease in specificity).
2. The closer the curve follows the left-hand border and then the top
border of the ROC space, the more accurate the test.
3. The closer the curve comes to the 45-degree diagonal of the ROC
space, the less accurate the test. 4.
The slope of the tangent line at a cut-point gives the likelihood ratio (LR) for
that value of the test.
5. The area under the curve (AUC) is a measure of text accuracy.
Fig 47 :AUC
AUC or Area under the curve is 78% ie dataset has 78.6 % concordant
pairs.
32
Fig 48 :ROC
INTERPRETATION:
Fig 49 :Code: KS
33
Fig 50 :Output: KS
34
Gini= AUC*2-1 ; AUC=0.78
=(0.78*2)-1
=0.56
We see that the Gini is 0.56 which is moderately adequate inequality if the
threshold is at 0.4.
35
Fig 51 :Code: Data normalization
36
Fig 54 :Output: Data Split
More specifically, the distance between the stored data and the new
instance is calculated by means of some kind of a similarity measure. This
similarity measure is typically expressed by a distance measure such as
the Euclidean distance, cosine similarity or the Manhattan distance.
In other words, the similarity to the data that was already in the system is
calculated for any new data point that you input into the system.
37
Fig 55 :Code: KNN
INTERPRETATION:
trainControl: Control the computational nuances of the train function.
repeatedcv: Cross validation method
repeats: No of times the cross validation to take place
38
From Fig 56 ,we can see that the train dataset is cross
validated in 10 fold and repeated 3 times.
Resampling of the dataset gives the optimum value of K with
a accuracy rate of 90.28%.
From the above prediction we can see that for k=9 the predictions
of churn =0 is 920 meaning that customers won’t leave and for
churn =1 is 79 which means they churn out.
39
CONFUSION MATRIX INTERPRETATION:
8. Naïve Bayes:
8.1 Applying Naïve Bayes:
Naïve Bayes classifiers are a family of simple "probabilistic
classifiers" based on applying Bayes' theorem with strong
(naïve) independence assumptions between the features. They are
among the simplest Bayesian network models.
It is based upon Bayes theorem
41
Since this algo runs on Conditional Probabilities it becomes very hard
to silo the continous variables as they have no frequency but a
continuum scale.
42
8.2 INTERPRETATION:
Navie Bayes works best with Categorical values but can be made to
work on mix datasets having continuous as well as categorical
variables as predictors like in cellphone dataset.
Since this algorithm runs on Conditional Probabilities it becomes very
hard to silo the continous variables as they have no frequency but a
continuum scale.
For continous variables: what NB does is takes their mean and
standard deviation or variability and treats it as cut off thresholds ;
say anything less than mean of distributed predictor values is 0 and
more than mean is 1.
Above law suits binary classifier ; however if we have multinomial
Response categories than it will have to go for quantiles, deciles n-
iles partitioning the data accordingly and assigning them the
probabilities.
Based on above NB’s working on mixed dataset and its accuracy is
always questionable.
Its findings and predictions need to be supported by other Classifiers
before any actionable operations
The Output for the NB model displays in the matrix format for each
predictor its mean [,1] and std deviation [,2] for class 1 and class 0.
The independence of predictors (no-multicollinearity) has been
assumed for sake of simplicity.
43
CONFUSION MATRIX INTERPRETATION:
INTERPRETATION:
Positive prediction value: 62.3%
Negative prediction value: 89.1%
Accuracy: 87.9%
44
9.CONFUSION MATRIX INTERPRETATION OF ALL MODELS:
Model
Parameter Logistic KNN(K Naïve Bayes
Regression Nearest
Neighbor’s)
Accuracy 0.8569 0.8929 0.8729
From the above table we can see that KNN model has a good
Accuracy(0.89) with positive prediction of (0.73) which seems
better when compared to other models in real time scenarios.
45
Of course this model (KNN) can be improved through better selection
of predictors and their interaction effects but the general case is worst
performer.
In Case of Logistic Regression , LR model also suffers from accuracy
paradox such that if threshold probability is decreses from 0.5 to say
0.2 or 0.1 then more cases will fall in Churner category (1).
Logistic Regression also performs poorly in case of general model
with positive prediction rate of 50.8% and Sensitivity of just 21.52%.
Navie Bayes works best with Categorical values but can be made to
work on mix datasets having continuous as well as categorical
variables as predictors like in cellphone dataset.
Since this algorithm runs on Conditional Probabilities it becomes very
hard to silo the continous variables as they have no frequency but a
continuum scale.
For continous variables what NB does is takes their mean and
standard deviation or variability and treats it as cut off thresholds ;
say anything less than mean of distributed predictor values is 0 and
more than mean is 1.
46
For KNN:
k-NN performs the best with Positive pred rate of 81% in
the general case model where the formula intends to take all
the 10 predictors irrespective of their type whether continous or
categorical.
Furthur the model can be tuned to get good prediction and
accuracy.
For Logistic Regression:
For logistic regression ,all variables need to be independent
of each other.
The model can be tuned by performance metrics for better
predictions .
12.CONCLUSION:
MODEL POSITIVE ACCURACY
PREDICTION
Logistic Regression 50.8% 85.6%
KNN 73.4% 89.2%
Naïve Bayes 62.3% 87.2%
k-NN performs the best with Positive pred rate of 73.4% in the
general case model where the formula intends to take all the 10
predictors irrespective of their type whether continous or categorical.
The intended or any refined / tuned target model should be able to
catch the Churners based on the data provided . Ofcourse the
dataset is lopsided in favor of more-NonChurners rather than our
intended target of finding Churners based on their behavior hidden in
the dataset.
47
Naive Bayes has no parameters to tune , but k-NN and Logit Regr
can be improved by fine tuning the train control parameters and also
deploying the up/down sampling approach for Logistic regression to
counteract the class imbalance.
48
APPENDIX
setwd('C:\\Users\\Viswanathan\\Desktop\\pgp-babi')
getwd()
library(DataExplorer)
library(readxl)
library(corrplot)
library(caTools)
library(gridExtra)
library(rpart)
library(rpart.plot)
library(randomForest)
library(data.table)
library(ROCR)
library(ineq)
library(InformationValue)
library(caret)
library(e1071)
library(car)
library(caret)
library(class)
library(devtools)
library(e1071)
library(ggplot2)
library(Hmisc)
library(klaR)
library(MASS)
library(nnet)
49
library(plyr)
library(pROC)
library(psych)
library(scatterplot3d)
library(SDMTools)
library(dplyr)
library(ElemStatLearn)
library(neuralnet)
library(rms)
library(gridExtra)
attach(Data)
head(Data)
tail(Data)
describe(Data)
summary(Data)
str(Data)
Data$Churn=as.factor(Data$Churn)
Data$ContractRenewal=as.factor(Data$ContractRenewal)
Data$DataPlan=as.factor(Data$DataPlan)
str(Data)
any(is.na(Data))
50
colSums(is.na(Data))
anyNA(Data)
par(mfrow=c(3,3))
plot_histogram(Data) #Numerical
#Categorical
plot_bar(Churn)
plot_bar(ContractRenewal)
plot_bar(DataPlan)
names(Data)
par(mar=c(2,2,2,2))
par(mfrow=c(4,4))
hist(Churn,xlab='Churn')
hist(AccountWeeks,xlab='AccountWeeks')
hist(ContractRenewal,xlab='Contract renewal')
hist(DataPlan,xlab='Data plan')
hist(DataUsage,xlab='Datausage')
hist(CustServCalls,xlab='Custservcalls')
hist(DayMins,xlab='Daymins')
hist(DayCalls,xlab='Daycalls')
51
hist(MonthlyCharge,xlab='Monthlycharge')
hist(OverageFee,xlab='overagefee')
hist(RoamMins,xlab='RoamMins')
plot_density(Data,geom_density_args = list(fill='gold',alpha=0.4))
#### Bivariate
library(gridExtra)
grid.arrange(p1,p2,p3,p4)
## AccountWeeks Vs Churn
d1=Data$AccountWeeks[Data$Churn==1]
mean(d1)
d2=Data$AccountWeeks[Data$Churn==0]
mean(d2)
## DayMinutes Vs Churn
d3=Data$DayMins[Data$Churn==1]
mean(d3)
d4=Data$DayMins[Data$Churn==0]
mean(d4)
## DayMinutes Vs Churn
52
d5=Data$DataUsage[Data$Churn==1]
mean(d5)
d6=Data$DataUsage[Data$Churn==0]
mean(d6)
names(Data)
#### Categorical
p6=ggplot(Data,aes(x=ContractRenewal))+geom_bar(aes(fill=Churn))
p7=ggplot(Data,aes(x=DataPlan))+geom_bar(aes(fill=Churn))
p8=ggplot(Data,aes(x=CustServCalls))+geom_bar(aes(fill=Churn))
grid.arrange(p6,p7,p8)
plot_boxplot(Data,by='Churn',geom_boxplot_args = list('outlier.color'='red',fill='blue'))
outlier1=boxplot(DataUsage)$out
print(outlier1)
outlier2=boxplot(DayMins)$out
print(outlier2)
outlier3=boxplot(MonthlyCharge)$out
print(outlier3)
53
#### Insight for Multicollinearity
a=round(cor(cell_numeric),2)
corrplot(a)
names(Data)
### VIF
lr=read.csv('Cellphone.csv',header = TRUE,sep=',')
LR=lm(Churn~., data=lr)
summary(LR)
vif(LR)
Data=Data[,-5]
str(Data)
set.seed(332)
split=createDataPartition(Data$Churn,p=0.7,list=FALSE)
train_Data=Data[split,]
test_Data=Data[-split,]
dim(train_Data)
dim(test_Data)
Model1=glm(train_Data$Churn~., data=train_Data,family='binomial')
summary(Model1)
## Contract renewal and data plan has negative impact on customer churn
54
## checking for varience inflation factor
vif(Model1)
?anova
anova(Model1,test='Chisq')
Model1$coefficients
## Likelihood ratio
lh=exp(Model1$coefficients)
prob=exp(coef(Model1))/1+exp(coef(Model1))
prob
## Interpretation
Model1_predicted=ifelse(Model1_pred>0.5,1,0)
#Factor conversion
Model1_predicted_factor=factor(Model1_predicted,levels = c(0,1))
head(Model1_predicted_factor)
## Confusion matrix
Model1.CM=confusionMatrix(Model1_predicted_factor,test_Data$Churn,positive='1')
Model1.CM
## ROC curve
LR_pred=predict(Model1,newdata = test_Data,type='response')
55
rocr_pred=prediction(LR_pred,test_Data$Churn)
perf=performance(rocr_pred,'tpr','fpr')
plot(perf)
plot(perf,colorize=TRUE,print.cutoffs.at=seq(0,1,0.05),text.adj=c(-0.2,1.7))
as.numeric(performance(rocr_pred,'auc')@y.values)
## KS
library(blorr)
ks=blr_gains_table(Model1)
blr_ks_chart(ks,title='KS chart',ks_line_color='black')
##GINI
LR_gini=Gini(LR_pred,Churn)
LR_gini
names(Data)
Data1=read.csv('Cellphone.csv',header = TRUE,sep=',')
norm=function(x){(x-min(x))/(max(x)-min(x))}
data_normalised=as.data.frame(lapply(Data1[,-1],norm))
library(tibble)
view(data_normalised)
data=cbind(Data1[,1],data_normalised)
str(data)
data$ContractRenewal=as.factor(data$ContractRenewal)
data$DataPlan=as.factor(data$DataPlan)
str(data)
set.seed(332)
split=createDataPartition(data$`Data1[, 1]`,p=0.7,list=FALSE)
56
Train_Data=data[split,]
Test_Data=data[-split,]
dim(Train_Data)
dim(Test_Data)
str(Train_Data)
attach(Train_Data)
set.seed(2020)
ctrl=trainControl(method='repeatedcv',repeats = 3)
knn.fit=train(`Data1[, 1]`~.,data=Train_Data,method='knn',trControl=ctrl
,preProcess=c('center','scale'),tuneLength=10)
knn.fit
Model2=knn(Train_Data[,-1],Test_Data[,-1],Train_Data[,1],k=9)
summary(Model2)
knn_table=table(Test_Data[,1],Model2)
knn_table
sum(diag(knn_table)/sum(knn_table))
knn.cm=confusionMatrix(Model2,Test_Data$`Data1[, 1]`,positive='1')
knn.cm
NB.fit=naiveBayes(Train_Data$`Data1[, 1]`~.,data=Train_Data)
NB.fit
NB.pred=predict(NB.fit,Test_Data)
NB.pred
57
NB.cm=confusionMatrix(NB.pred,Test_Data$`Data1[, 1]`,positive='1')
NB.cm
58