Analysis
Analysis
Analysis
Analyst
2025-05-28
Each churned customer represents a loss in long-term revenue, and possibly signals a broader issue in service,
satisfaction, or risk management. The goal is to identify patterns that predict churn before it happens, so the
bank can take preventive action — such as personal follow-up, offers, or upgrades.
This makes it a binary classification task where: - Target variable: Churn (1 = customer left, 0 = customer
stayed) - Input variables: age, income, credit usage, transaction patterns, etc.
A successful model would allow the bank to: - Detect customers at risk of churn - Intervene early and improve
retention
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 1/37
6/8/25, 9:21 PM Analysis
2.3 Modelling
Select and train two machine learning models (e.g. Logistic Regression and Random Forest)
Evaluate each using key metrics: Accuracy, Precision, Recall, F1-Score
Discuss differences in model logic and behavior
Attempt hyperparameter tuning or ensembling to improve results
df <- read_csv("BankChurners.csv")
head(df)
## # A tibble: 6 × 23
## CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level
## <dbl> <chr> <dbl> <chr> <dbl> <chr>
## 1 768805383 Existing Custom… 45 M 3 High School
## 2 818770008 Existing Custom… 49 F 5 Graduate
## 3 713982108 Existing Custom… 51 M 3 Graduate
## 4 769911858 Existing Custom… 40 F 4 High School
## 5 709106358 Existing Custom… 40 M 3 Uneducated
## 6 713061558 Existing Custom… 44 M 2 Graduate
## # ℹ 17 more variables: Marital_Status <chr>, Income_Category <chr>,
## # Card_Category <chr>, Months_on_book <dbl>, Total_Relationship_Count <dbl>,
## # Months_Inactive_12_mon <dbl>, Contacts_Count_12_mon <dbl>,
## # Credit_Limit <dbl>, Total_Revolving_Bal <dbl>, Avg_Open_To_Buy <dbl>,
## # Total_Amt_Chng_Q4_Q1 <dbl>, Total_Trans_Amt <dbl>, Total_Trans_Ct <dbl>,
## # Total_Ct_Chng_Q4_Q1 <dbl>, Avg_Utilization_Ratio <dbl>,
## # Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count
_Education_Level_Months_Inactive_12_mon_1 <dbl>, …
str(df)
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 2/37
6/8/25, 9:21 PM Analysis
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 3/37
6/8/25, 9:21 PM Analysis
## .. Attrition_Flag = col_character(),
## .. Customer_Age = col_double(),
## .. Gender = col_character(),
## .. Dependent_count = col_double(),
## .. Education_Level = col_character(),
## .. Marital_Status = col_character(),
## .. Income_Category = col_character(),
## .. Card_Category = col_character(),
## .. Months_on_book = col_double(),
## .. Total_Relationship_Count = col_double(),
## .. Months_Inactive_12_mon = col_double(),
## .. Contacts_Count_12_mon = col_double(),
## .. Credit_Limit = col_double(),
## .. Total_Revolving_Bal = col_double(),
## .. Avg_Open_To_Buy = col_double(),
## .. Total_Amt_Chng_Q4_Q1 = col_double(),
## .. Total_Trans_Amt = col_double(),
## .. Total_Trans_Ct = col_double(),
## .. Total_Ct_Chng_Q4_Q1 = col_double(),
## .. Avg_Utilization_Ratio = col_double(),
## .. Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_co
unt_Education_Level_Months_Inactive_12_mon_1 = col_double(),
## .. Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_co
unt_Education_Level_Months_Inactive_12_mon_2 = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
summary(df)
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 4/37
6/8/25, 9:21 PM Analysis
This gives us a solid starting point for cleaning and transforming the data before modeling.
Here is the improved and clearer version of section 2.2 Data Cleaning, with attention to the why, not just the
what, and aligned to your goal of keeping it clear and human-written.
1. Remove CLIENTNUM : This is a unique identifier for each customer. It has no predictive value and may
introduce noise into the model.
2. Drop the two Naive Bayes prediction columns: These are outputs from a past machine learning model
and are not part of the raw customer information. Keeping them would leak information and bias our
analysis.
Customers who have left the credit card service ( Attrition_Flag == "Attrited Customer" ) are
labeled as 1 .
Customers who are still active ( Attrition_Flag == "Existing Customer" ) are labeled as 0 .
4. Remove the original Attrition_Flag column after creating the binary label.
df <- df %>%
select(-CLIENTNUM, -starts_with("Naive_Bayes")) %>%
mutate(
Churn = ifelse(Attrition_Flag == "Attrited Customer", 1, 0)
) %>%
select(-Attrition_Flag)
This results in a cleaned dataset where every column can contribute meaningfully to predicting churn, without
introducing information leakage or ID noise.
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 6/37
6/8/25, 9:21 PM Analysis
str(df)
df %>%
select(where(is.numeric)) %>%
summary()
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 7/37
6/8/25, 9:21 PM Analysis
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 8/37
6/8/25, 9:21 PM Analysis
colSums(is.na(df))
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 9/37
6/8/25, 9:21 PM Analysis
A result of all zeros confirms the dataset is complete — there are no missing values to
handle.
df %>%
select(
Customer_Age,
Months_on_book,
Credit_Limit,
Avg_Open_To_Buy,
Total_Revolving_Bal,
Total_Amt_Chng_Q4_Q1,
Total_Trans_Amt,
Total_Trans_Ct,
Total_Ct_Chng_Q4_Q1,
Avg_Utilization_Ratio
) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
facet_wrap(~variable, scales = "free", ncol = 3) +
theme_minimal() +
labs(title = "Histograms of Continuous Numeric Features Only")
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 10/37
6/8/25, 9:21 PM Analysis
Observations:
Customer_Age is nearly normal, centered around 46 years, with minor left skew.
Months_on_book is mostly symmetric but with a noticeable spike at 36 — likely
a common tenure.
Credit_Limit and Avg_Open_To_Buy are highly right-skewed — many
customers have small limits while a few have very high credit ceilings.
Total_Revolving_Bal shows a large number of customers at 0 balance,
followed by a flat spread — this indicates many customers pay off their full
balance.
Total_Amt_Chng_Q4_Q1 and Total_Ct_Chng_Q4_Q1 both show a concentrated
unimodal peak, meaning most customers fall in a narrow range of quarter-over-
quarter change.
Total_Trans_Amt and Total_Trans_Ct have a bi-modal or clustered shape,
indicating different usage groups (e.g. low vs high spenders).
Avg_Utilization_Ratio is heavily right-skewed with a long tail, showing most
customers use a small portion of their limit, but a few max it out.
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 11/37
6/8/25, 9:21 PM Analysis
df %>%
select(
Customer_Age,
Months_on_book,
Credit_Limit,
Avg_Open_To_Buy,
Total_Revolving_Bal,
Total_Amt_Chng_Q4_Q1,
Total_Trans_Amt,
Total_Trans_Ct,
Total_Ct_Chng_Q4_Q1,
Avg_Utilization_Ratio
) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(y = value)) +
geom_boxplot(outlier.color = "red") +
facet_wrap(~variable, scales = "free", ncol = 3) +
theme_minimal() +
labs(title = "Boxplots for Outlier Detection (Separate Scales)", x = "", y = "Value")
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 12/37
6/8/25, 9:21 PM Analysis
All numeric variables have been reviewed for distribution, range, and potential outliers.
No missing values were found, so no imputation was required.
Outliers were assessed visually and retained intentionally, as they likely represent meaningful customer
behaviors.
The CLIENTNUM and machine-generated columns were removed to prevent noise or data leakage.
The target variable Churn has been clearly defined as a binary outcome.
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 13/37
6/8/25, 9:21 PM Analysis
## $Gender
## . Freq
## 1 F 5358
## 2 M 4769
##
## $Education_Level
## . Freq
## 1 College 1013
## 2 Doctorate 451
## 3 Graduate 3128
## 4 High School 2013
## 5 Post-Graduate 516
## 6 Uneducated 1487
## 7 Unknown 1519
##
## $Marital_Status
## . Freq
## 1 Divorced 748
## 2 Married 4687
## 3 Single 3943
## 4 Unknown 749
##
## $Income_Category
## . Freq
## 1 $120K + 727
## 2 $40K - $60K 1790
## 3 $60K - $80K 1402
## 4 $80K - $120K 1535
## 5 Less than $40K 3561
## 6 Unknown 1112
##
## $Card_Category
## . Freq
## 1 Blue 9436
## 2 Gold 116
## 3 Platinum 20
## 4 Silver 555
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 14/37
6/8/25, 9:21 PM Analysis
df %>%
select(Gender, Education_Level, Marital_Status, Income_Category, Card_Category, Churn) %>%
pivot_longer(-Churn, names_to = "variable", values_to = "category") %>%
group_by(variable, category) %>%
summarise(
churn_rate = mean(Churn),
count = n(),
.groups = "drop"
) %>%
ggplot(aes(x = reorder(category, -churn_rate), y = churn_rate)) +
geom_col(fill = "steelblue") +
facet_wrap(~variable, scales = "free", ncol = 2) +
coord_flip() +
theme_minimal() +
labs(title = "Churn Rate by Categorical Variables", x = "Category", y = "Churn Rate")
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 15/37
6/8/25, 9:21 PM Analysis
Card_Category: Customers with “Platinum” and “Gold” cards churn more than
those with “Blue” cards, possibly due to higher expectations or unmet premium
service.
Education_Level: Those with lower or “Unknown” education levels show higher
churn, while Doctorate holders churn the least.
Gender: Female customers have a slightly higher churn rate than male
customers.
Income_Category: Lower income groups churn more. “Unknown” income has
surprisingly high churn, possibly indicating risk or disengagement.
Marital_Status: “Single” customers churn the most, followed by “Unknown,”
while “Married” customers are most stable.
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 16/37
6/8/25, 9:21 PM Analysis
df %>%
select(
Churn,
Customer_Age,
Months_on_book,
Credit_Limit,
Total_Trans_Amt,
Avg_Utilization_Ratio
) %>%
pivot_longer(-Churn, names_to = "variable", values_to = "value") %>%
ggplot(aes(x = factor(Churn), y = value, fill = factor(Churn))) +
geom_boxplot() +
facet_wrap(~variable, scales = "free", ncol = 2) +
theme_minimal() +
labs(title = "Numeric Variables by Churn", x = "Churn (0 = No, 1 = Yes)", y = "Value") +
scale_fill_manual(values = c("0" = "steelblue", "1" = "tomato"))
>
The numeric variable boxplots reveal key churn-related differences: > > - Avg_Utilization_Ratio: Churned
customers tend to have lower utilization, suggesting less credit engagement before leaving. > - Credit_Limit:
Churners generally have lower credit limits — they may represent lower-tier customers with less institutional
loyalty. > - Customer_Age: Churned customers are slightly older on average, possibly indicating a late decision to
cut services. > - Months_on_book: Churners have shorter relationships with the bank, reinforcing the idea that
newer customers churn faster. > - Total_Trans_Amt: This shows the strongest signal — churners spend far
less than loyal customers, making it a highly predictive variable.
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 17/37
6/8/25, 9:21 PM Analysis
library(ggcorrplot)
ggcorrplot(cor_matrix,
type = "lower", # Show lower triangle only
method = "square", # Use filled color squares
lab = TRUE, # Show correlation values
lab_size = 2.5, # Smaller label font
tl.cex = 10, # Axis text size
title = "Clean Correlation Heatmap",
colors = c("tomato", "gray", "steelblue"),
ggtheme = theme_minimal())
>
Correlation Insights: > > - Total_Trans_Ct and Total_Trans_Amt show the strongest positive correlation (0.81),
which makes sense — more transactions usually mean higher total amounts. > - Credit_Limit is strongly
correlated (0.95) with Avg_Open_To_Buy, which is expected since open credit is derived from total limit minus
usage. > - Avg_Utilization_Ratio is negatively correlated with Avg_Open_To_Buy (-0.54), meaning high
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 18/37
6/8/25, 9:21 PM Analysis
utilization typically reduces available credit. > - Most other variables are weakly correlated (< 0.4), suggesting
limited multicollinearity across features. > - This supports the idea that dimensionality reduction (PCA) may not
be urgently needed, but could still be tested to simplify modeling.
Numeric Variables
Customer_Age: Slightly older customers (median ~46) tend to churn more.
Credit_Limit and Avg_Open_To_Buy: Right-skewed; churners generally have lower limits and less
available credit.
Total_Trans_Amt and Total_Trans_Ct: Strong predictors — churners spend and transact less.
Avg_Utilization_Ratio: Lower for churners, suggesting reduced credit engagement before churn.
Months_on_book: Churners often have shorter tenure with the bank.
No missing values were found, and outliers were reviewed but retained as they reflect genuine customer
behaviors. Correlation analysis showed no severe multicollinearity, with the exception of expected relationships
(e.g., between Credit_Limit and Avg_Open_To_Buy).
Categorical Variables
Gender: Fairly balanced; females show slightly higher churn.
Education_Level: Lower education and “Unknown” levels are linked to higher churn.
Marital_Status: Single and unknown status customers churn more; married customers churn less.
Income_Category: Lower-income groups and “Unknown” have higher churn risk.
Card_Category: “Platinum” and “Gold” cardholders churn more, though “Blue” dominates in frequency
(93%).
All categorical variables are clean with no NAs, though many include “Unknown” levels. These were retained for
analysis, as they may capture meaningful business signals.
Overall, the EDA shows that churn is associated with lower usage, shorter
relationships, and less financial engagement. These patterns will guide both
feature engineering and model selection in the next steps.
5. Data Transformation
5.1 Categorical Encoding
To prepare the dataset for machine learning, we converted all categorical variables into numerical format using
one-hot encoding:
library(fastDummies)
df_encoded <- df %>%
fastDummies::dummy_cols(remove_first_dummy = TRUE, remove_selected_columns = TRUE)
names(df_encoded)
df_encoded %>%
select(all_of(continuous_vars)) %>%
summary()
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 20/37
6/8/25, 9:21 PM Analysis
After scaling, all continuous features have zero mean and unit variance, as expected.
Most values fall within ±1, but a few like Total_Ct_Chng_Q4_Q1 and
Total_Amt_Chng_Q4_Q1 have extreme max values, suggesting sharp behavioral
changes for some customers. These may be useful for identifying churn-prone
individuals. No issues seen.
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 21/37
6/8/25, 9:21 PM Analysis
6.2 Clustering
K-Means clustering was applied to the scaled numeric features, including engineered variables, to explore natural
customer groupings.
Method:
We used the Elbow Method to determine the optimal number of clusters.
The elbow clearly appears at k = 2, indicating two main clusters in the data.
K-Means was then applied with k = 3 to capture more nuanced subgroup patterns.
# Libraries needed
library(cluster)
library(factoextra)
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 22/37
6/8/25, 9:21 PM Analysis
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 23/37
6/8/25, 9:21 PM Analysis
Insights:
The PCA cluster plot shows a dominant overlap between Cluster 2 and Cluster 3, meaning their
behaviors are very similar.
Cluster 1 appears somewhat distinct, possibly representing a higher-risk or low-engagement group.
However, churn labels were spread across all clusters with no clear separation, meaning clustering does
not improve churn classification directly.
As a result, clusters were not included in the final model, but still revealed behavior-based segmentation
valuable for business profiling or targeted retention strategies.
summary(pca_result)
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 24/37
6/8/25, 9:21 PM Analysis
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0213 1.5761 1.39823 1.35563 1.28981 1.1789 1.17517
## Proportion of Variance 0.1135 0.0690 0.05431 0.05105 0.04621 0.0386 0.03836
## Cumulative Proportion 0.1135 0.1825 0.23680 0.28785 0.33406 0.3727 0.41102
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.16294 1.10814 1.09413 1.08527 1.07555 1.06261 1.03794
## Proportion of Variance 0.03757 0.03411 0.03325 0.03272 0.03213 0.03136 0.02993
## Cumulative Proportion 0.44859 0.48270 0.51596 0.54867 0.58081 0.61217 0.64210
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 1.02467 1.01451 1.00816 1.00177 0.99606 0.99362 0.99098
## Proportion of Variance 0.02917 0.02859 0.02823 0.02788 0.02756 0.02742 0.02728
## Cumulative Proportion 0.67126 0.69985 0.72808 0.75596 0.78352 0.81094 0.83822
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.97838 0.94985 0.9354 0.78938 0.76638 0.69254 0.5911
## Proportion of Variance 0.02659 0.02506 0.0243 0.01731 0.01632 0.01332 0.0097
## Cumulative Proportion 0.86481 0.88987 0.9142 0.93149 0.94780 0.96112 0.9708
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.47349 0.45842 0.41050 0.38823 0.35393 0.34826 0.22361
## Proportion of Variance 0.00623 0.00584 0.00468 0.00419 0.00348 0.00337 0.00139
## Cumulative Proportion 0.97706 0.98289 0.98758 0.99176 0.99524 0.99861 1.00000
## PC36
## Standard deviation 2.578e-15
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00
Results: - The first 10 components explain ~52% of the variance, while the first
20 explain over 86%. - This shows that dimensionality can be reduced from 36 to
~20 features with minimal information loss. - However, PCA components lack
interpretability.
Thus, PCA will be retained for optional use in models sensitive to dimensionality
(like Logistic Regression), but not used in tree-based models like Random Forest,
which handle multicollinearity internally.
7. Modeling
7.1 Model Selection
Two models were selected for their strengths, informed by earlier PCA and clustering steps:
1. Logistic Regression (PCA-enhanced) Used top principal components to reduce dimensionality, eliminate
multicollinearity, and improve interpretability.
2. Random Forest Chosen for its ability to handle complex interactions and work well on full feature sets,
including clustering-related patterns.
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 25/37
6/8/25, 9:21 PM Analysis
set.seed(123)
library(caret)
##
## Attaching package: 'caret'
library(randomForest)
## randomForest 4.7-1.2
##
## Attaching package: 'randomForest'
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 26/37
6/8/25, 9:21 PM Analysis
library(rpart)
library(class)
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 27/37
6/8/25, 9:21 PM Analysis
# 2. Random Forest
rf_preds <- predict(rf_model, newdata = test_data)
rf_cm <- confusionMatrix(
data = rf_preds,
reference = as.factor(test_data$Churn),
positive = "1"
)
# 3. Decision Tree
dt_preds <- predict(dt_model, newdata = test_data, type = "class")
dt_cm <- confusionMatrix(
data = dt_preds,
reference = as.factor(test_data$Churn),
positive = "1"
)
# 4. K-Nearest Neighbors
knn_cm <- confusionMatrix(
data = knn_preds,
reference = knn_test_labels,
positive = "1"
)
print(log_cm)
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 28/37
6/8/25, 9:21 PM Analysis
print(rf_cm)
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 29/37
6/8/25, 9:21 PM Analysis
print(dt_cm)
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 30/37
6/8/25, 9:21 PM Analysis
print("k-NN Results")
print(knn_cm)
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 31/37
6/8/25, 9:21 PM Analysis
Random
Metric Logistic Regression Forest Decision Tree K-Nearest Neighbors
Insights:
Random Forest clearly outperformed the other models across almost all metrics. It achieved the highest
accuracy (94.8%), sensitivity (76%), and kappa score (0.79), indicating strong predictive power and
consistency.
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 32/37
6/8/25, 9:21 PM Analysis
Decision Tree performed very closely to Random Forest, especially in sensitivity (76.4%) and balanced
accuracy (86.2%), but was slightly less stable (lower kappa).
Logistic Regression, although interpretable and quick, struggled with low sensitivity (37.4%). It missed a
large number of churners — which is risky in a churn prediction context.
k-NN, while better than logistic regression in sensitivity (48.4%), also lagged behind ensemble-based
models. It was more prone to false negatives and was sensitive to scaling and feature noise.
Logistic Regression with PCA, though fast, is not suitable alone due to high false negatives.
PCA helped reduce dimensionality for logistic regression but did not close the performance gap compared
to tree-based models.
set.seed(123)
ctrl <- trainControl(method = "cv", number = 5)
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 33/37
6/8/25, 9:21 PM Analysis
print(rf_tuned$bestTune)
## mtry
## 5 10
print(max(rf_tuned$results$Accuracy))
## [1] 0.9643301
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 34/37
6/8/25, 9:21 PM Analysis
##
## Logistic Regression (L1) Best Lambda:
print(log_l1$lambda.min)
## [1] 0.001112295
## [1] 0.8780247
##
## Decision Tree Best cp:
print(dt_tuned$bestTune)
## cp
## 1 0.001
print(max(dt_tuned$results$Accuracy))
## [1] 0.9432259
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 35/37
6/8/25, 9:21 PM Analysis
##
## KNN Best k:
print(knn_tuned$bestTune)
## k
## 1 3
cat("KNN Accuracy:\n")
## KNN Accuracy:
print(max(knn_tuned$results$Accuracy))
## [1] 0.8864454
After applying hyperparameter tuning to each model, we observed the following results:
Insights:
Random Forest achieved the highest accuracy after tuning, confirming its robustness in handling complex,
high-dimensional data.
Decision Tree also performed well after tuning, achieving over 94% accuracy with a small complexity
parameter.
KNN performed reasonably but still underperformed compared to tree-based models.
Logistic Regression (with PCA and L1 penalty) was fast and interpretable but had the lowest accuracy.
Still, its performance is acceptable for baseline modeling.
Conclusion:
Random Forest remains the top choice for production deployment based on accuracy (96.43%). However,
Decision Tree may be favored in scenarios where simpler model interpretability is needed. PCA helped in
simplifying data for logistic regression, but with limited performance gains.
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 36/37
6/8/25, 9:21 PM Analysis
The table compares the performance of four models before and after hyperparameter tuning. Random Forest
showed the highest improvement, reaching an impressive 96.43% accuracy with mtry = 10 . While Logistic
Regression saw minimal gain, all models benefited from tuning, especially Decision Tree which rose from 92.84%
to 94.32%.
file:///C:/Users/Administrator/Desktop/WORK/Assignments/Fiverr/Symeonef-R-Airplane/BankChurners/BankChurners/Analysis.html 37/37