0% found this document useful (0 votes)

103 views6 pages

1) Download The Binary Classification Dataset For... - Colab

The document outlines a project focused on developing and comparing Logistic Regression, SVM, and KNN models for predicting personal loan defaults using a specific dataset. It details the model development process, including hyperparameter tuning, regularization, and performance evaluation using the F1 score. Ultimately, the SVM model achieved the highest F1 score, indicating it as the best-performing model among the three analyzed.

Uploaded by

jacky pundu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

103 views6 pages

1) Download The Binary Classification Dataset For... - Colab

Uploaded by

jacky pundu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

NAME:PRATHAM

ROLL NO:23126039

OBJECTIVE: This objective encompasses the following key aspects of the assignment:

Model Development: It explicitly mentions the development of Logistic Regression, SVM, and KNN models.

Performance Comparison: It highlights the goal of comparing the performance of these models.

Dataset Specificity: It accurately identifies the "personal loan default prediction dataset."

Hyperparameter Tuning & Regularization: It emphasizes the importance of exploring hyperparameter tuning and regularization.

Model Evaluation: It specifies the use of the F1 score as the primary evaluation metric.

Model Selection: It states the aim to select the optimal model based on the evaluation results.

Generalization: it includes the concept of model generalization.

from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remoun

import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset

df = pd.read_csv("/content/drive/MyDrive/IML lab/lab6/loan_data.csv") #Please upload the loan_data.csv file to the colab env

# Dataset Explanation
print(df.head())
print(df.info())
print(df.describe())

# Explanation:
# The dataset contains information about personal loans, including:
# - person_age: Age of the borrower.
# - person_gender: Gender of the borrower.
# - person_education: Education level of the borrower.
# - person_income: Annual income of the borrower.
# - person_emp_exp: Employment experience in years.
# - person_home_ownership: Home ownership status.
# - loan_amnt: Loan amount.
# - loan_intent: Purpose of the loan.
# - loan_int_rate: Interest rate of the loan.
# - loan_percent_income: Loan amount as a percentage of income.
# - cb_person_cred_hist_length: Credit history length.
# - credit_score: credit score.
# - previous_loan_defaults_on_file: if the person has previous loan defaults.
# - loan_status: Loan default status (0 = No default, 1 = Default). This is the target variable.

# Train-Test Split
X = df.drop('loan_status', axis=1)
y = df['loan_status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Train set shape:", X_train.shape, y_train.shape)

print("Test set shape:", X_test.shape, y_test.shape)
10 cb_person_cred_hist_length 45000 non-null float64
11 credit_score 45000 non-null int64
12 previous_loan_defaults_on_file 45000 non-null object
13 loan_status 45000 non-null int64
dtypes: float64(6), int64(3), object(5)
memory usage: 4.8+ MB
None
person_age person_income person_emp_exp loan_amnt \
count 45000.000000 4.500000e+04 45000.000000 45000.000000
mean 27.764178 8.031905e+04 5.410333 9583.157556
std 6.045108 8.042250e+04 6.063532 6314.886691
min 20.000000 8.000000e+03 0.000000 500.000000
25% 24.000000 4.720400e+04 1.000000 5000.000000
50% 26.000000 6.704800e+04 4.000000 8000.000000
75% 30.000000 9.578925e+04 8.000000 12237.250000
max 144.000000 7.200766e+06 125.000000 35000.000000

loan_int_rate loan_percent_income cb_person_cred_hist_length \

count 45000.000000 45000.000000 45000.000000
mean 11.006606 0.139725 5.867489
std 2.978808 0.087212 3.879702
min 5.420000 0.000000 2.000000
25% 8.590000 0.070000 3.000000
50% 11.010000 0.120000 4.000000
75% 12.990000 0.190000 8.000000
max 20.000000 0.660000 30.000000

credit_score loan_status
count 45000.000000 45000.000000
mean 632.608756 0.222222
std 50.435865 0.415744
min 390.000000 0.000000
25% 601.000000 0.000000
50% 640.000000 0.000000
75% 670.000000 0.000000
max 850.000000 1.000000
Train set shape: (36000, 13) (36000,)
Test set shape: (9000, 13) (9000,)

3. Logistic Regression Model Development

NAME:PRATHAM

ROLL NO:23126039

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score

# Preprocessing
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

numerical_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
])

# Logistic Regression Pipeline

logistic_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='liblinear', random_state=42))
])

# Train the model

logistic_pipeline.fit(X_train, y_train)

# Predictions
y_train_pred = logistic_pipeline.predict(X_train)
y_test_pred = logistic_pipeline.predict(X_test)

# F1 Score
train_f1 = f1_score(y_train, y_train_pred)
test_f1 = f1_score(y_test, y_test_pred)
print(f"Train F1 Score: {train_f1}")
print(f"Test F1 Score: {test_f1}")

Train F1 Score: 0.7642629227823867

Test F1 Score: 0.7583926754832147

4. Regularization

# Regularized Logistic Regression (L1 and L2)

l1_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='liblinear', penalty='l1', random_state=42))
])

l2_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='liblinear', penalty='l2', random_state=42))
])

l1_pipeline.fit(X_train, y_train)
l2_pipeline.fit(X_train, y_train)

l1_test_pred = l1_pipeline.predict(X_test)
l2_test_pred = l2_pipeline.predict(X_test)

l1_test_f1 = f1_score(y_test, l1_test_pred)

l2_test_f1 = f1_score(y_test, l2_test_pred)

print(f"L1 Regularization Test F1 Score: {l1_test_f1}")

print(f"L2 Regularization Test F1 Score: {l2_test_f1}")

L1 Regularization Test F1 Score: 0.7578144853875477

L2 Regularization Test F1 Score: 0.7583926754832147

5. Varying λ (C in Logistic Regression)

NAME:PRATHAM

ROLL NO:23126039

results = []
C_values = [0.001, 0.01, 0.1, 1, 10, 100]

for C in C_values:
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='liblinear', C=C, random_state=42))
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
f1 = f1_score(y_test, y_pred)
results.append({'C': C, 'Test F1 Score': f1})

results_df = pd.DataFrame(results)
print(results_df)

C Test F1 Score
0 0.001 0.714126
1 0.010 0.751487
2 0.100 0.757252
3 1.000 0.758393
4 10.000 0.757814
5 100.000 0.757814

6. Comparison with Inbuilt Model

# Inbuilt Logistic Regression

inbuilt_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(random_state=42)) #uses lbfgs as default solver, and l2 as default penalty.
])

inbuilt_pipeline.fit(X_train, y_train)
inbuilt_test_pred = inbuilt_pipeline.predict(X_test)
inbuilt_test_f1 = f1_score(y_test, inbuilt_test_pred)
print(f"Inbuilt Logistic Regression Test F1 Score: {inbuilt_test_f1}")
#The deviation is likely due to the different default solver and regularization methods employed by the inbuilt model compar

Inbuilt Logistic Regression Test F1 Score: 0.7585856016280844

7. SVM Implementation and Hyperparameter Tuning

from sklearn.svm import SVC

from sklearn.pipeline import Pipeline # Import the Pipeline class

svm_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', SVC(random_state=42))
])

svm_pipeline.fit(X_train, y_train)
svm_test_pred = svm_pipeline.predict(X_test)
svm_test_f1 = f1_score(y_test, svm_test_pred)

print(f"SVM Test F1 Score: {svm_test_f1}")

svm_results = []
C_values_svm = [0.1, 1, 10, 100]

for C in C_values_svm:
svm_pipeline_tuned = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', SVC(C=C, random_state=42))
])
svm_pipeline_tuned.fit(X_train, y_train)
y_pred_svm = svm_pipeline_tuned.predict(X_test)
f1_svm = f1_score(y_test, y_pred_svm)
svm_results.append({'C': C, 'Test F1 Score': f1_svm})

svm_results_df = pd.DataFrame(svm_results)
print(svm_results_df)

SVM Test F1 Score: 0.8013716697441309

C Test F1 Score
0 0.1 0.781457
1 1.0 0.801372
2 10.0 0.804227
3 100.0 0.786705

8. KNN Implementation

from sklearn.neighbors import KNeighborsClassifier

knn_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', KNeighborsClassifier())
])

knn_pipeline.fit(X_train, y_train)
knn_test_pred = knn_pipeline.predict(X_test)
knn_test_f1 = f1_score(y_test, knn_test_pred)

print(f"KNN Test F1 Score: {knn_test_f1}")

KNN Test F1 Score: 0.7477572559366754

NAME:PRATHAM

ROLL NO:23126039

9. KNN Hyperparameter Tuning

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd

# Assuming X_train, X_test, y_train, y_test, and preprocessor are already defined from previous steps

knn_results = []
neighbors = [3, 5, 7, 9]
distance_metrics = ['euclidean', 'manhattan', 'minkowski']

for n in neighbors:
for metric in distance_metrics:
knn_pipeline_tuned = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', KNeighborsClassifier(n_neighbors=n, metric=metric))
])
knn_pipeline_tuned.fit(X_train, y_train)
y_pred_knn = knn_pipeline_tuned.predict(X_test)
f1_knn = f1_score(y_test, y_pred_knn)
knn_results.append({'Neighbors': n, 'Distance Metric': metric, 'Test F1 Score': f1_knn})

knn_results_df = pd.DataFrame(knn_results)
print(knn_results_df)

# 10. Conclusion

# Compare the performance of Logistic Regression, SVM, and KNN

logistic_pipeline.fit(X_train, y_train)
logistic_test_pred = logistic_pipeline.predict(X_test)
logistic_test_f1 = f1_score(y_test, logistic_test_pred)

svm_pipeline.fit(X_train, y_train)
svm_test_pred = svm_pipeline.predict(X_test)
svm_test_f1 = f1_score(y_test, svm_test_pred)

knn_pipeline.fit(X_train, y_train)
knn_test_pred = knn_pipeline.predict(X_test)
knn_test_f1 = f1_score(y_test, knn_test_pred)

print(f"Logistic Regression Test F1 Score: {logistic_test_f1}")

print(f"SVM Test F1 Score: {svm_test_f1}")
print(f"KNN Test F1 Score: {knn_test_f1}")

# Conclusion:
# Based on the F1 scores, we can compare the performance of the three models:
# - Logistic Regression: [logistic_test_f1 value]
# - Support Vector Machine (SVM): [svm_test_f1 value]
# - K-Nearest Neighbors (KNN): [knn_test_f1 value]

# Based on the results obtained, the best performing model for this dataset is usually SVM or Logistic regression. The KNN p

#Generally, SVM provided the highest F1 score in most of the cases. Logistic regression also provided good scores, and is mu

Neighbors Distance Metric Test F1 Score

0 3 euclidean 0.742546
1 3 manhattan 0.743081
2 3 minkowski 0.742546
3 5 euclidean 0.747757
4 5 manhattan 0.756285
5 5 minkowski 0.747757
6 7 euclidean 0.759500
7 7 manhattan 0.760986
8 7 minkowski 0.759500
9 9 euclidean 0.760481
10 9 manhattan 0.764263
11 9 minkowski 0.760481
Logistic Regression Test F1 Score: 0.7583926754832147
SVM Test F1 Score: 0.8013716697441309
KNN Test F1 Score: 0.7477572559366754

The SVM (Support Vector Machine) model has the highest F1 score (0.8013716697441309), making it the best-performing model among the
three.

ML Lab Assessment3.Ipynb - Colab
No ratings yet
ML Lab Assessment3.Ipynb - Colab
3 pages
Note 4
No ratings yet
Note 4
18 pages
Open Lab 2
No ratings yet
Open Lab 2
15 pages
Bank Loan
No ratings yet
Bank Loan
85 pages
Loan - Approval - Prediction - Ipynb - Colab
No ratings yet
Loan - Approval - Prediction - Ipynb - Colab
7 pages
Linear Regression for Beginners
No ratings yet
Linear Regression for Beginners
6 pages
AIML Lab Ex 3-5 - 1
No ratings yet
AIML Lab Ex 3-5 - 1
31 pages
Dsbda 3a
No ratings yet
Dsbda 3a
11 pages
'Universalbank - CSV': #Reading The File
No ratings yet
'Universalbank - CSV': #Reading The File
4 pages
Practical 3
No ratings yet
Practical 3
8 pages
AML Project LearnerNotebook LowCode
No ratings yet
AML Project LearnerNotebook LowCode
74 pages
Assignment 03
No ratings yet
Assignment 03
6 pages
Aosdijfpqoiew
No ratings yet
Aosdijfpqoiew
6 pages
Week 12 Assignment
No ratings yet
Week 12 Assignment
8 pages
MSML Project 1
No ratings yet
MSML Project 1
8 pages
Classification Problems
100% (1)
Classification Problems
25 pages
Apex Financial Services Loan Data Automation
0% (1)
Apex Financial Services Loan Data Automation
18 pages
Data Frame Notes3
No ratings yet
Data Frame Notes3
39 pages
Dsa Lab Manual
No ratings yet
Dsa Lab Manual
35 pages
ML Lab - V Sem - Bca
No ratings yet
ML Lab - V Sem - Bca
22 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
Loan Default Prediction System
No ratings yet
Loan Default Prediction System
13 pages
FYMCA IDSLab A6 Submission
No ratings yet
FYMCA IDSLab A6 Submission
9 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
Credit Card Default Analysis
No ratings yet
Credit Card Default Analysis
5 pages
Loan Default Prediction System 1753830667
No ratings yet
Loan Default Prediction System 1753830667
11 pages
Jamboree
No ratings yet
Jamboree
10 pages
LogisticRegression RST
No ratings yet
LogisticRegression RST
11 pages
Final-12-Lab Programs
No ratings yet
Final-12-Lab Programs
30 pages
Data Analysis for Workforce Insights
No ratings yet
Data Analysis for Workforce Insights
12 pages
R Programing 6 Feb
No ratings yet
R Programing 6 Feb
10 pages
ML Practice Assignment
No ratings yet
ML Practice Assignment
7 pages
Model2.ipynb - Colab
No ratings yet
Model2.ipynb - Colab
11 pages
CALCULATION
No ratings yet
CALCULATION
15 pages
ML Lab - BCSL606
No ratings yet
ML Lab - BCSL606
67 pages
#Group: B (ML) : Numpy NP Pandas PD
No ratings yet
#Group: B (ML) : Numpy NP Pandas PD
9 pages
Data Science Record - 05
No ratings yet
Data Science Record - 05
20 pages
Chapter 5 - Classification Problems
100% (1)
Chapter 5 - Classification Problems
25 pages
Openlab 1
No ratings yet
Openlab 1
17 pages
Final 2 MLP
No ratings yet
Final 2 MLP
10 pages
Howxtre
No ratings yet
Howxtre
8 pages
TYCS Practical
No ratings yet
TYCS Practical
26 pages
Loan Interest Prediction Using Linear Regression
No ratings yet
Loan Interest Prediction Using Linear Regression
26 pages
Mid-Sem Model Answer 7
No ratings yet
Mid-Sem Model Answer 7
5 pages
DS - Assig-03-Part-I - Jupyter Notebook
No ratings yet
DS - Assig-03-Part-I - Jupyter Notebook
8 pages
Intro LOGIT
No ratings yet
Intro LOGIT
46 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
3 pages
Data Analytucs 1
No ratings yet
Data Analytucs 1
5 pages
Data Clustering for Analysts
No ratings yet
Data Clustering for Analysts
8 pages
Preprocessing1.ipynb - Colab
No ratings yet
Preprocessing1.ipynb - Colab
13 pages
ML Assignment
No ratings yet
ML Assignment
3 pages
MLT Ann Lab 2
No ratings yet
MLT Ann Lab 2
7 pages
DSBDA Assignment 4 Jupyter Notebook
No ratings yet
DSBDA Assignment 4 Jupyter Notebook
5 pages
Machine Learning Credit Rating Model
No ratings yet
Machine Learning Credit Rating Model
12 pages
DADV - Lab - Subject - 303105315
No ratings yet
DADV - Lab - Subject - 303105315
35 pages
Scikit-Learn for Data Scientists
No ratings yet
Scikit-Learn for Data Scientists
32 pages
Transferability Analysis of Data-Driven Additive Manufacturing Knowledge: A Case Study Between Powder Bed Fusion and Directed Energy Deposition
No ratings yet
Transferability Analysis of Data-Driven Additive Manufacturing Knowledge: A Case Study Between Powder Bed Fusion and Directed Energy Deposition
11 pages
Neural Networks for Beginners
No ratings yet
Neural Networks for Beginners
4 pages
Soal UAS Deep Learning
No ratings yet
Soal UAS Deep Learning
6 pages
Application of ANN in Pavement - Review
100% (1)
Application of ANN in Pavement - Review
61 pages
DL Question Bank Answers
No ratings yet
DL Question Bank Answers
55 pages
PAC Learning & Machine Learning Course
No ratings yet
PAC Learning & Machine Learning Course
36 pages
IEEE Paper SPD
No ratings yet
IEEE Paper SPD
4 pages
Data Mining Midterm Exam Analysis
No ratings yet
Data Mining Midterm Exam Analysis
4 pages
Module 5 AI
No ratings yet
Module 5 AI
14 pages
Anjum, Nasreen Et Al. (2025) Cyber-Biosecurity Challenges in Next-Generation Sequencing A Comprehensive Analysis of Emerging Threat Vectors
No ratings yet
Anjum, Nasreen Et Al. (2025) Cyber-Biosecurity Challenges in Next-Generation Sequencing A Comprehensive Analysis of Emerging Threat Vectors
8 pages
Chapter 5
No ratings yet
Chapter 5
22 pages
What Is Deep Learning?: Artificial Intelligence Machine Learning
No ratings yet
What Is Deep Learning?: Artificial Intelligence Machine Learning
3 pages
Chapter 18 SB Answers
100% (1)
Chapter 18 SB Answers
9 pages
Digital Forensics Tools & Trends Review
No ratings yet
Digital Forensics Tools & Trends Review
12 pages
MD006 Introducción
No ratings yet
MD006 Introducción
52 pages
4th Semester Data Science Syllabus
No ratings yet
4th Semester Data Science Syllabus
10 pages
FoodBiotech - AI in Spoilage Dectection
No ratings yet
FoodBiotech - AI in Spoilage Dectection
1 page
AI-Powered Cloud Security Solutions
No ratings yet
AI-Powered Cloud Security Solutions
14 pages
Regression Models A Concise Tutorial 1752984550
No ratings yet
Regression Models A Concise Tutorial 1752984550
21 pages
Managing Team Centricity in Modern Organizations-Information Age Publishing (2022)
No ratings yet
Managing Team Centricity in Modern Organizations-Information Age Publishing (2022)
337 pages
Challenges in Digital Medicine Applications in Underresourced Settingsnature Communications
No ratings yet
Challenges in Digital Medicine Applications in Underresourced Settingsnature Communications
5 pages
CS3381-Oops Lab - Rubrics Final
No ratings yet
CS3381-Oops Lab - Rubrics Final
5 pages
Predicting YouTube Adviews
No ratings yet
Predicting YouTube Adviews
2 pages
Computational Approaches in Biomaterials - Pranav Deepak
No ratings yet
Computational Approaches in Biomaterials - Pranav Deepak
323 pages
Final Year Project Report
100% (1)
Final Year Project Report
59 pages
Lecture1 Introduction CVML
No ratings yet
Lecture1 Introduction CVML
26 pages
Transformation at Microsoft
100% (1)
Transformation at Microsoft
27 pages
AI Unit 5
No ratings yet
AI Unit 5
18 pages
Tan Rethinking The Up-Sampling Operations in CNN-based Generative Network For Generalizable CVPR 2024 Paper
100% (1)
Tan Rethinking The Up-Sampling Operations in CNN-based Generative Network For Generalizable CVPR 2024 Paper
10 pages
Unit 6 Test 3 Grade 11
0% (1)
Unit 6 Test 3 Grade 11
5 pages

1) Download The Binary Classification Dataset For... - Colab

Uploaded by

1) Download The Binary Classification Dataset For... - Colab

Uploaded by

NAME:PRATHAM

Generalization: it includes the concept of model generalization.

from google.colab import drive

# Load the dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Train set shape:", X_train.shape, y_train.shape)

loan_int_rate loan_percent_income cb_person_cred_hist_length \

3. Logistic Regression Model Development

from sklearn.linear_model import LogisticRegression

# Logistic Regression Pipeline

# Train the model

Train F1 Score: 0.7642629227823867

# Regularized Logistic Regression (L1 and L2)

l1_test_f1 = f1_score(y_test, l1_test_pred)

print(f"L1 Regularization Test F1 Score: {l1_test_f1}")

L1 Regularization Test F1 Score: 0.7578144853875477

5. Varying λ (C in Logistic Regression)

6. Comparison with Inbuilt Model

# Inbuilt Logistic Regression

Inbuilt Logistic Regression Test F1 Score: 0.7585856016280844

7. SVM Implementation and Hyperparameter Tuning

from sklearn.svm import SVC

print(f"SVM Test F1 Score: {svm_test_f1}")

SVM Test F1 Score: 0.8013716697441309

from sklearn.neighbors import KNeighborsClassifier

print(f"KNN Test F1 Score: {knn_test_f1}")

KNN Test F1 Score: 0.7477572559366754

9. KNN Hyperparameter Tuning

from sklearn.neighbors import KNeighborsClassifier

# Compare the performance of Logistic Regression, SVM, and KNN

print(f"Logistic Regression Test F1 Score: {logistic_test_f1}")

Neighbors Distance Metric Test F1 Score

You might also like