NAME:PRATHAM
ROLL NO:23126039
OBJECTIVE: This objective encompasses the following key aspects of the assignment:
Model Development: It explicitly mentions the development of Logistic Regression, SVM, and KNN models.
Performance Comparison: It highlights the goal of comparing the performance of these models.
Dataset Specificity: It accurately identifies the "personal loan default prediction dataset."
Hyperparameter Tuning & Regularization: It emphasizes the importance of exploring hyperparameter tuning and regularization.
Model Evaluation: It specifies the use of the F1 score as the primary evaluation metric.
Model Selection: It states the aim to select the optimal model based on the evaluation results.
Generalization: it includes the concept of model generalization.
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remoun
import pandas as pd
from sklearn.model_selection import train_test_split
# Load the dataset
df = pd.read_csv("/content/drive/MyDrive/IML lab/lab6/loan_data.csv") #Please upload the loan_data.csv file to the colab env
# Dataset Explanation
print(df.head())
print(df.info())
print(df.describe())
# Explanation:
# The dataset contains information about personal loans, including:
# - person_age: Age of the borrower.
# - person_gender: Gender of the borrower.
# - person_education: Education level of the borrower.
# - person_income: Annual income of the borrower.
# - person_emp_exp: Employment experience in years.
# - person_home_ownership: Home ownership status.
# - loan_amnt: Loan amount.
# - loan_intent: Purpose of the loan.
# - loan_int_rate: Interest rate of the loan.
# - loan_percent_income: Loan amount as a percentage of income.
# - cb_person_cred_hist_length: Credit history length.
# - credit_score: credit score.
# - previous_loan_defaults_on_file: if the person has previous loan defaults.
# - loan_status: Loan default status (0 = No default, 1 = Default). This is the target variable.
# Train-Test Split
X = df.drop('loan_status', axis=1)
y = df['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Train set shape:", X_train.shape, y_train.shape)
print("Test set shape:", X_test.shape, y_test.shape)
10 cb_person_cred_hist_length 45000 non-null float64
11 credit_score 45000 non-null int64
12 previous_loan_defaults_on_file 45000 non-null object
13 loan_status 45000 non-null int64
dtypes: float64(6), int64(3), object(5)
memory usage: 4.8+ MB
None
person_age person_income person_emp_exp loan_amnt \
count 45000.000000 4.500000e+04 45000.000000 45000.000000
mean 27.764178 8.031905e+04 5.410333 9583.157556
std 6.045108 8.042250e+04 6.063532 6314.886691
min 20.000000 8.000000e+03 0.000000 500.000000
25% 24.000000 4.720400e+04 1.000000 5000.000000
50% 26.000000 6.704800e+04 4.000000 8000.000000
75% 30.000000 9.578925e+04 8.000000 12237.250000
max 144.000000 7.200766e+06 125.000000 35000.000000
loan_int_rate loan_percent_income cb_person_cred_hist_length \
count 45000.000000 45000.000000 45000.000000
mean 11.006606 0.139725 5.867489
std 2.978808 0.087212 3.879702
min 5.420000 0.000000 2.000000
25% 8.590000 0.070000 3.000000
50% 11.010000 0.120000 4.000000
75% 12.990000 0.190000 8.000000
max 20.000000 0.660000 30.000000
credit_score loan_status
count 45000.000000 45000.000000
mean 632.608756 0.222222
std 50.435865 0.415744
min 390.000000 0.000000
25% 601.000000 0.000000
50% 640.000000 0.000000
75% 670.000000 0.000000
max 850.000000 1.000000
Train set shape: (36000, 13) (36000,)
Test set shape: (9000, 13) (9000,)
3. Logistic Regression Model Development
NAME:PRATHAM
ROLL NO:23126039
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score
# Preprocessing
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns
numerical_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
])
# Logistic Regression Pipeline
logistic_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='liblinear', random_state=42))
])
# Train the model
logistic_pipeline.fit(X_train, y_train)
# Predictions
y_train_pred = logistic_pipeline.predict(X_train)
y_test_pred = logistic_pipeline.predict(X_test)
# F1 Score
train_f1 = f1_score(y_train, y_train_pred)
test_f1 = f1_score(y_test, y_test_pred)
print(f"Train F1 Score: {train_f1}")
print(f"Test F1 Score: {test_f1}")
Train F1 Score: 0.7642629227823867
Test F1 Score: 0.7583926754832147
4. Regularization
# Regularized Logistic Regression (L1 and L2)
l1_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='liblinear', penalty='l1', random_state=42))
])
l2_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='liblinear', penalty='l2', random_state=42))
])
l1_pipeline.fit(X_train, y_train)
l2_pipeline.fit(X_train, y_train)
l1_test_pred = l1_pipeline.predict(X_test)
l2_test_pred = l2_pipeline.predict(X_test)
l1_test_f1 = f1_score(y_test, l1_test_pred)
l2_test_f1 = f1_score(y_test, l2_test_pred)
print(f"L1 Regularization Test F1 Score: {l1_test_f1}")
print(f"L2 Regularization Test F1 Score: {l2_test_f1}")
L1 Regularization Test F1 Score: 0.7578144853875477
L2 Regularization Test F1 Score: 0.7583926754832147
5. Varying λ (C in Logistic Regression)
NAME:PRATHAM
ROLL NO:23126039
results = []
C_values = [0.001, 0.01, 0.1, 1, 10, 100]
for C in C_values:
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='liblinear', C=C, random_state=42))
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
f1 = f1_score(y_test, y_pred)
results.append({'C': C, 'Test F1 Score': f1})
results_df = pd.DataFrame(results)
print(results_df)
C Test F1 Score
0 0.001 0.714126
1 0.010 0.751487
2 0.100 0.757252
3 1.000 0.758393
4 10.000 0.757814
5 100.000 0.757814
6. Comparison with Inbuilt Model
# Inbuilt Logistic Regression
inbuilt_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(random_state=42)) #uses lbfgs as default solver, and l2 as default penalty.
])
inbuilt_pipeline.fit(X_train, y_train)
inbuilt_test_pred = inbuilt_pipeline.predict(X_test)
inbuilt_test_f1 = f1_score(y_test, inbuilt_test_pred)
print(f"Inbuilt Logistic Regression Test F1 Score: {inbuilt_test_f1}")
#The deviation is likely due to the different default solver and regularization methods employed by the inbuilt model compar
Inbuilt Logistic Regression Test F1 Score: 0.7585856016280844
7. SVM Implementation and Hyperparameter Tuning
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline # Import the Pipeline class
svm_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', SVC(random_state=42))
])
svm_pipeline.fit(X_train, y_train)
svm_test_pred = svm_pipeline.predict(X_test)
svm_test_f1 = f1_score(y_test, svm_test_pred)
print(f"SVM Test F1 Score: {svm_test_f1}")
svm_results = []
C_values_svm = [0.1, 1, 10, 100]
for C in C_values_svm:
svm_pipeline_tuned = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', SVC(C=C, random_state=42))
])
svm_pipeline_tuned.fit(X_train, y_train)
y_pred_svm = svm_pipeline_tuned.predict(X_test)
f1_svm = f1_score(y_test, y_pred_svm)
svm_results.append({'C': C, 'Test F1 Score': f1_svm})
svm_results_df = pd.DataFrame(svm_results)
print(svm_results_df)
SVM Test F1 Score: 0.8013716697441309
C Test F1 Score
0 0.1 0.781457
1 1.0 0.801372
2 10.0 0.804227
3 100.0 0.786705
8. KNN Implementation
from sklearn.neighbors import KNeighborsClassifier
knn_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', KNeighborsClassifier())
])
knn_pipeline.fit(X_train, y_train)
knn_test_pred = knn_pipeline.predict(X_test)
knn_test_f1 = f1_score(y_test, knn_test_pred)
print(f"KNN Test F1 Score: {knn_test_f1}")
KNN Test F1 Score: 0.7477572559366754
NAME:PRATHAM
ROLL NO:23126039
9. KNN Hyperparameter Tuning
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd
# Assuming X_train, X_test, y_train, y_test, and preprocessor are already defined from previous steps
knn_results = []
neighbors = [3, 5, 7, 9]
distance_metrics = ['euclidean', 'manhattan', 'minkowski']
for n in neighbors:
for metric in distance_metrics:
knn_pipeline_tuned = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', KNeighborsClassifier(n_neighbors=n, metric=metric))
])
knn_pipeline_tuned.fit(X_train, y_train)
y_pred_knn = knn_pipeline_tuned.predict(X_test)
f1_knn = f1_score(y_test, y_pred_knn)
knn_results.append({'Neighbors': n, 'Distance Metric': metric, 'Test F1 Score': f1_knn})
knn_results_df = pd.DataFrame(knn_results)
print(knn_results_df)
# 10. Conclusion
# Compare the performance of Logistic Regression, SVM, and KNN
logistic_pipeline.fit(X_train, y_train)
logistic_test_pred = logistic_pipeline.predict(X_test)
logistic_test_f1 = f1_score(y_test, logistic_test_pred)
svm_pipeline.fit(X_train, y_train)
svm_test_pred = svm_pipeline.predict(X_test)
svm_test_f1 = f1_score(y_test, svm_test_pred)
knn_pipeline.fit(X_train, y_train)
knn_test_pred = knn_pipeline.predict(X_test)
knn_test_f1 = f1_score(y_test, knn_test_pred)
print(f"Logistic Regression Test F1 Score: {logistic_test_f1}")
print(f"SVM Test F1 Score: {svm_test_f1}")
print(f"KNN Test F1 Score: {knn_test_f1}")
# Conclusion:
# Based on the F1 scores, we can compare the performance of the three models:
# - Logistic Regression: [logistic_test_f1 value]
# - Support Vector Machine (SVM): [svm_test_f1 value]
# - K-Nearest Neighbors (KNN): [knn_test_f1 value]
# Based on the results obtained, the best performing model for this dataset is usually SVM or Logistic regression. The KNN p
#Generally, SVM provided the highest F1 score in most of the cases. Logistic regression also provided good scores, and is mu
Neighbors Distance Metric Test F1 Score
0 3 euclidean 0.742546
1 3 manhattan 0.743081
2 3 minkowski 0.742546
3 5 euclidean 0.747757
4 5 manhattan 0.756285
5 5 minkowski 0.747757
6 7 euclidean 0.759500
7 7 manhattan 0.760986
8 7 minkowski 0.759500
9 9 euclidean 0.760481
10 9 manhattan 0.764263
11 9 minkowski 0.760481
Logistic Regression Test F1 Score: 0.7583926754832147
SVM Test F1 Score: 0.8013716697441309
KNN Test F1 Score: 0.7477572559366754
The SVM (Support Vector Machine) model has the highest F1 score (0.8013716697441309), making it the best-performing model among the
three.