Unsupervised Credit Scoring Models
Bachelor of Science in Computer Science
Prepared by:
Tran Anh Vu V202100569
Vu Duy Tung V202100528
Nguyen Canh Huy V202100401
Under the supervision of:
Instructor’s Name: Prof. Doan Dang Khoa
Related course: COMP3020 - Machine Learning
1
COLLEGE OF ENGINEERING AND COMPUTER SCIENCE
VINUNIVERSITY
November , 2025
Team Members Role
Name Student ID Role Detail
Analyze and preprocess data, host
Tran Anh Vu V202100569 Data Analyst
weekly meetings.
Explore supervised
learning algorithms,
Vu Duy Tung V202100528 Build a pipeline of models.
Consider future
directions
Domain-guided
Literature review, manage team
Nguyen Canh Huy V202100401 developer, Project
progress and task completion.
Facilitator
Note: Roles can be considered to change flexibly to suit during the work process.
2
I. Data Acquisition
As mentioned in the previous task, the unavailability of the credit dataset collected from
SMS information of customers led to several difficulties, not only in model training but also in
model evaluation, because there is no ground truth to verify our predictions. Additionally, since
the features were only extracted from SMS data, their significance often overlapped, meaning the
actual meaningful information provided was not very high. This lack of diversity and
completeness in the features limited the quality of the analysis and the reliability of the results.
Therefore, in this step, we decided to use a public dataset that is labeled and contains 20
non-null financial features of users. This dataset provides a more robust foundation for training
and evaluating our models, ensuring better accuracy and reliability in the outcomes.
● Data Description:
The Statlog (German Credit Data) dataset, donated to the UCI Machine Learning
Repository on November 16, 1994, is a multivariate dataset used for classification tasks in the
social sciences. The dataset contains 1,000 instances, each representing a person, and 20 features
describing their credit risk. The features are a mix of categorical and integer types and contain no
missing values. The dataset aims to classify individuals as good or bad credit risks based on
these attributes.
The dataset is available in two formats: "german.data" with categorical/symbolic
attributes and "german.data-numeric" with numerical attributes. The latter was created by
Strathclyde University for algorithms that require numerical input. The "german.data-numeric"
file includes indicator variables and integer coding for ordered categorical attributes.
The dataset includes information about the individual's checking account status, credit
history, purpose of the credit, credit amount, savings account/bonds details, employment history,
installment rate, marital status, other debtors/guarantors, residence history, property ownership,
age, other installment plans, housing, number of existing credits, job, number of dependents,
telephone registration, and foreign worker status.
II. Progress Description
A. Data Exploration
3
Figure 1. Distribution of features in the dataset
We began our exploratory data analysis by plotting histograms of the 20 features in our
dataset, which include both numerical and categorical variables (Figure 1). These plots help us
understand the underlying distribution of the data and identify potential issues such as skewness
or imbalance. For continuous variables, such as credit_amount, the histograms reveal a
noticeable left-skewed trend, which indicates the presence of extreme outliers. This observation
is further validated by the boxplot of credit_amount (Figure 2), where several values fall far
outside the interquartile range. Identifying such outliers is crucial as they can disproportionately
influence the performance of machine learning models. On the other hand, the categorical
features exhibit varying degrees of balance in their distributions. While some, such as
foreign_worker, other_debtors, and num_dependents, are highly imbalanced, most categorical
features show a more even distribution. This is encouraging because a balanced dataset generally
provides better model training and evaluation conditions, which may lead to more reliable
predictions.
4
Figure 2. Boxplot for ‘credit_amount’ values
To gain deeper insights, we also created diagonal distribution plots for each pair of
features, with labels color-coded to visualize whether the target variable values are separable
across feature pairs. This visualization provides an intuitive understanding of how well
individual features or combinations of features can distinguish between creditworthy and
non-creditworthy applicants. Additionally, a correlation map (Figure 4 ) was generated to
examine relationships among features and their correlations with the target variable. Features that
are highly correlated with the target variable are more likely to have predictive power, while
those that are strongly correlated with one another may introduce redundancy into the model.
Figure 3. Pairwise distribution of features colored by labels
As seen in Figure 3, the distributions of most features remain largely unchanged
regardless of the target label, suggesting that the features alone may not provide strong
separability between classes. This observation implies that applying dimensionality reduction
techniques, such as PCA, to simplify the dataset and reduce computational cost might negatively
5
impact model performance, as it could obscure subtle patterns in the data. Instead, representation
learning techniques, such as using kernels to generate new feature dimensions where the data is
more separable, may prove to be a more effective strategy. These newly created dimensions can
capture complex relationships between features and help improve model performance. By
identifying these patterns early in the analysis, we can better tailor feature engineering and
modeling techniques to optimize results.
Figure 4. Correlation matrix among features in the dataset
B. Data Preprocessing
Building on the preprocessing techniques outlined in the previous report, we
implemented a z-score outlier removal algorithm to eliminate instances with values that deviate
significantly from the mean. This step ensures that extreme outliers, which could negatively
impact model performance, are excluded from the dataset. Unlike the previous dataset, missing
value imputation was not required in this case, as all features in the new dataset are complete and
contain no missing values. This allowed us to focus more on encoding categorical features for
better compatibility with machine learning models.
To prepare the dataset for further experiments, we encoded several categorical features
into numeric values, as most machine learning algorithms require numerical inputs. The two
primary encoding methods used were Leave-One-Out (LOO) target encoding and one-hot
encoding. LOO target encoding preserves some information about the relationship between the
feature and the target variable, making it especially useful for models sensitive to target
6
correlations. On the other hand, one-hot encoding is a standard approach for converting
categorical variables into binary vectors, ensuring that no ordinal relationship is assumed
between categories. By applying these preprocessing steps, we aimed to create a clean and
well-prepared dataset suitable for robust model testing and experimentation.
C. Training Models
1) Experimental Setup
The objective of this experiment is to predict credit risk based on historical and metadata
features. The dataset consists of two labels: 1 (bad credit risk) and 2 (good credit risk). To
evaluate multiple machine learning models quickly, we selected a diverse set of classifiers:
Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN),
Decision Tree Classifier (CART), Gaussian Naive Bayes (NB), Random Forest (RF), Support
Vector Machine (SVM), and Extreme Gradient Boosting (XGBoost). Each model was trained
and tested on the dataset, and their recall scores were computed.
The recall metric was prioritized because the task focuses on identifying individuals with
a bad credit risk (label 1). In such a scenario, false negatives (incorrectly classifying bad credit
risks as good) pose a more significant problem than false positives. High recall ensures that most
individuals with bad credit risks are identified, minimizing the risk to financial institutions. The
mean and standard deviation of recall scores across test data were computed to assess the
models’ consistency. Observations revealed that while most models had low recall, CART, NB,
and XGBoost performed better.
After identifying promising models, hyperparameter tuning using Grid Search was
applied to improve the Random Forest model’s performance. The parameter grid included
max_depth, n_estimators, and max_features to balance model complexity and predictive power.
The best recall score for Random Forest after tuning was 0.491, with parameters {'max_depth':
None, 'max_features': 20, 'n_estimators': 5}. The corresponding accuracy was 0.736, with a
confusion matrix indicating 158 true positives, 20 false negatives, 46 false positives, and 26 true
negatives.
Finally, Gaussian Naive Bayes was evaluated in-depth as it demonstrated the best recall
score among models. Detailed performance metrics showed its recall was 0.53 for the bad credit
risk class, achieving a balance between precision and recall. This makes NB a suitable candidate
for this credit risk prediction task.
2) Evaluation Metrics
The primary evaluation metric for this task is recall, defined as the ratio of correctly
identified positive cases to the total actual positives. Recall was chosen because the cost of
misclassifying bad credit risks (false negatives) outweighs the cost of misclassifying good credit
risks (false positives). High recall minimizes the likelihood of overlooking individuals with bad
credit, thereby reducing financial risks.
Complementary metrics such as precision, accuracy, and F1-score were also computed to
7
provide a comprehensive view of model performance. However, the primary focus remained on
recall for decision-making.
To ensure robust evaluation, a cross-validation strategy with 10 folds was employed. The
dataset was split into training and validation sets across folds, and the mean recall was computed
using the following pipeline code:
Figure 5. Pseudo code for evaluation pipeline
This cross-validation approach mitigates the risk of overfitting and provides a reliable
estimate of model performance on unseen data. The process helps select the model and
hyperparameters best suited for credit risk prediction. Gaussian Naive Bayes emerged as the top
performer, achieving the highest recall score, making it the most effective model for identifying
bad credit risks in the dataset.
D. Discussion
The results of our experiment highlight the challenges and nuances of credit risk
prediction, a domain where identifying high-risk individuals (bad credit risks) is critical for
minimizing financial losses. In this context, the choice of recall as the primary metric aligns with
the domain's requirements. However, achieving a balance between recall and other metrics such
as precision and accuracy remains a challenging task due to the trade-offs inherent in model
performance.
The exploratory evaluation of various machine learning models reveals some important
insights. Simpler models such as Logistic Regression (LR) and K-Nearest Neighbors (KNN)
underperformed in terms of recall, likely because they lack the capacity to capture complex
patterns in the data. Linear Discriminant Analysis (LDA) and Support Vector Machines (SVM),
while theoretically more robust, also struggled with recall, suggesting that the dataset's feature
space may not align well with the linear separability or distributional assumptions required by
these models.
On the other hand, ensemble-based methods and probabilistic models performed better.
The Decision Tree Classifier (CART), Random Forest (RF), and Extreme Gradient Boosting
(XGBoost) achieved higher recall values, albeit with varying degrees of consistency. Gaussian
Naive Bayes (NB) emerged as the best performer for recall, likely due to its ability to model
probabilistic relationships between features and target labels effectively. This result underscores
the importance of leveraging probabilistic reasoning in credit risk prediction, where the dataset
8
may exhibit noise or class imbalances.
The hyperparameter tuning of Random Forest further emphasized the impact of model
configuration on performance. The best parameters, including a large max_features and a
minimal n_estimators, suggest that the model benefited from leveraging diverse features while
avoiding excessive complexity. Despite this optimization, the Random Forest model’s recall
remained lower than that of Gaussian Naive Bayes, highlighting the latter's suitability for this
task.
A deeper look into the confusion matrix and additional metrics revealed that even the
best-performing models have limitations. For instance, Gaussian Naive Bayes demonstrated a
trade-off between precision and recall, with moderate overall accuracy. These findings reflect the
inherent difficulties of predicting credit risk in real-world scenarios, where the data may be
noisy, imbalanced, or lack sufficient signal for precise classification.
III. Future plans
Moving forward, our focus will be on two key directions to enhance the utility and
applicability of the credit risk prediction system. First, we will explore more advanced feature
selection methods, including techniques such as recursive feature elimination, feature importance
analysis using tree-based models, and unsupervised methods like autoencoders. These
approaches will aim to identify the most predictive features while minimizing noise and
redundancy, leading to more efficient and accurate models.
Second, we aim to develop a general-purpose credit risk prediction model tailored to
industry needs. This model will leverage the most widely available features across datasets to
ensure adaptability and scalability. In parallel, a user-friendly interface (UI) will be designed to
enable seamless interaction for industry stakeholders, offering functionalities such as data input,
model interpretation, and detailed performance reports. This effort will bridge the gap between
advanced predictive analytics and practical financial decision-making.
V. Communication Method
1. Team Communication and Planning
Communication Platforms: Use Messenger for daily updates and discussions.
Weekly Meetings: Conduct virtual meetings to track progress, upcoming tasks, and potential
blocks.
Shared Calendar: Maintain a project calendar with deadlines, meeting minutes, and events.
2. Planning and Initiation
Scope Definition: Clearly define the project scope to set boundaries and expectations.
Resource Allocation: Identify and assign resources (personnel, technology) from the outset.
3. Agile Framework
Methodology: Implement Agile for flexibility and iterative progress.
Weekly Stand-Ups: Conduct short weekly meetings to align the team and address obstacles.
9
Sprint Reviews and Retrospectives: Evaluate completed work and processes for continuous
improvement.
4. Tracking Requirements, Risks, and Issues
Project Workspace: Utilize Google Sheets for: comprehensive requirements document outlining
scope, objectives, and deliverables; risk register to identify, assess, and monitor potential risks
with mitigation strategies; an issues log to track challenges encountered during the project
lifecycle, documenting resolutions to prevent recurrence.
VI. Members' contributions
Vu Duy Tung:
- Implement model pipeline include the following components: integrating feature
selection and preprocessing methods from Tran Vu, optimal hyperparameters initializing
and searching component, models’ component, and evaluation component.
Tran Anh Vu:
- Hosted weekly meetings to discuss the projects.
- Collected new data that improves disadvangtages of the previous unlabeld dataset.
- Visualized and analyzed data in different methods to get insights from the raw dataset.
- Conducted data preprocessing methods .
Nguyen Canh Huy:
- Researched and applied evaluation metrics of credit scoring used by existing creditors
around the world.
- Researched related works, and conducted the literature review for the topic.
- Manage the progress of the team, tracked and reminded other team members to complete
assigned tasks
VII. Conclusion
In this study, we explored various machine learning models for credit risk prediction,
focusing on recall as the primary metric to address the critical need for identifying individuals
with bad credit risks. Among the models evaluated, Gaussian Naive Bayes demonstrated the
highest recall, highlighting the effectiveness of probabilistic approaches in handling noisy and
imbalanced datasets. Hyperparameter tuning of Random Forest further emphasized the
importance of model optimization, although its recall remained below that of Gaussian Naive
Bayes. The findings underscore the challenges of achieving a balance between recall, precision,
and overall accuracy in this domain. Our results provide a strong foundation for future
advancements, including the use of advanced feature selection methods and the development of a
generalizable credit risk prediction model with an industry-oriented interface, ensuring both
practical utility and robust performance.
10