Jamboree Case Study
#1. Load and Explore the Data
!gdown
https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/001/839
/original/Jamboree_Admission.csv
Downloading...
From:
https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/001/839
/original/Jamboree_Admission.csv
To: /content/Jamboree_Admission.csv
0% 0.00/16.2k [00:00<?, ?B/s] 100% 16.2k/16.2k [00:00<00:00,
44.4MB/s]
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_absolute_error, mean_squared_error,
r2_score
from statsmodels.api import OLS, add_constant
# Load the dataset
df = pd.read_csv('Jamboree_Admission.csv')
# Check dataset information
print("Dataset Information:")
print(df.info())
# Check for missing values and duplicates
print("\nMissing Values per Column:\n", df.isna().sum())
print("\nDuplicate Rows:", df.duplicated().sum())
Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Serial No. 500 non-null int64
1 GRE Score 500 non-null int64
2 TOEFL Score 500 non-null int64
3 University Rating 500 non-null int64
4 SOP 500 non-null float64
5 LOR 500 non-null float64
6 CGPA 500 non-null float64
7 Research 500 non-null int64
8 Chance of Admit 500 non-null float64
dtypes: float64(4), int64(5)
memory usage: 35.3 KB
None
Missing Values per Column:
Serial No. 0
GRE Score 0
TOEFL Score 0
University Rating 0
SOP 0
LOR 0
CGPA 0
Research 0
Chance of Admit 0
dtype: int64
Duplicate Rows: 0
#2. Data Cleaning and Optimization
# Drop "Serial No." column
df = df.drop(columns=["Serial No."])
# Rename columns for consistency
df.rename(columns={'LOR ': 'LOR', 'Chance of Admit ': 'Chance of
Admit'}, inplace=True)
# Optimize Data Types
df['GRE Score'] = df['GRE Score'].astype('int16')
df['TOEFL Score'] = df['TOEFL Score'].astype('int8')
df['University Rating'] = df['University Rating'].astype('int8')
df['SOP'] = df['SOP'].astype('float32')
df['LOR'] = df['LOR'].astype('float32')
df['CGPA'] = df['CGPA'].astype('float32')
df['Research'] = df['Research'].astype('bool')
df['Chance of Admit'] = df['Chance of Admit'].astype('float32')
print("Optimized Dataset Information:")
print(df.info())
#3. Exploratory Data Analysis (EDA)
Summary Statistics
print("\nSummary Statistics:")
print(df.describe())
Summary Statistics:
Serial No. GRE Score TOEFL Score University Rating
SOP \
count 500.000000 500.000000 500.000000 500.000000
500.000000
mean 250.500000 316.472000 107.192000 3.114000
3.374000
std 144.481833 11.295148 6.081868 1.143512
0.991004
min 1.000000 290.000000 92.000000 1.000000
1.000000
25% 125.750000 308.000000 103.000000 2.000000
2.500000
50% 250.500000 317.000000 107.000000 3.000000
3.500000
75% 375.250000 325.000000 112.000000 4.000000
4.000000
max 500.000000 340.000000 120.000000 5.000000
5.000000
LOR CGPA Research Chance of Admit
count 500.00000 500.000000 500.000000 500.00000
mean 3.48400 8.576440 0.560000 0.72174
std 0.92545 0.604813 0.496884 0.14114
min 1.00000 6.800000 0.000000 0.34000
25% 3.00000 8.127500 0.000000 0.63000
50% 3.50000 8.560000 1.000000 0.72000
75% 4.00000 9.040000 1.000000 0.82000
max 5.00000 9.920000 1.000000 0.97000
Check Distributions of Numerical Variables
# Rename columns to remove any trailing spaces
df.rename(columns=lambda x: x.strip(), inplace=True)
# Visualize numerical distributions
numerical_columns = ['GRE Score', 'TOEFL Score', 'CGPA', 'Chance of
Admit']
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
for col, ax in zip(numerical_columns, axes.flatten()):
sns.histplot(df[col], kde=True, ax=ax)
ax.set_title(f"Distribution of {col}")
plt.tight_layout()
plt.show()
Categorical Variables
# Pie charts and counts
categorical_columns = ['University Rating', 'SOP', 'LOR', 'Research']
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
for col, ax in zip(categorical_columns, axes.flatten()):
if col == 'Research':
data = df[col].value_counts()
ax.pie(data, labels=['No Research', 'Research'], autopct='%.1f
%%', startangle=90)
ax.set_title("Research Experience")
else:
sns.countplot(x=df[col], ax=ax, palette='coolwarm')
ax.set_title(col)
plt.tight_layout()
plt.show()
<ipython-input-8-a09489fe7f28>:10: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be
removed in v0.14.0. Assign the `x` variable to `hue` and set
`legend=False` for the same effect.
sns.countplot(x=df[col], ax=ax, palette='coolwarm')
<ipython-input-8-a09489fe7f28>:10: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be
removed in v0.14.0. Assign the `x` variable to `hue` and set
`legend=False` for the same effect.
sns.countplot(x=df[col], ax=ax, palette='coolwarm')
<ipython-input-8-a09489fe7f28>:10: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be
removed in v0.14.0. Assign the `x` variable to `hue` and set
`legend=False` for the same effect.
sns.countplot(x=df[col], ax=ax, palette='coolwarm')
#4. Insights from Correlation Analysis
Heatmap
# Correlation Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f',
linewidths=0.5)
plt.title("Feature Correlation Heatmap")
plt.show()
Key Insights: GRE Score, TOEFL Score, and CGPA show strong positive correlations with the
Chance of Admit.
Research is positively correlated but weaker compared to numerical scores.
Minimal correlation between independent variables indicates no multicollinearity issues.
#5. Feature Engineering
# Separate dependent and independent variables
X = df.drop(columns=['Chance of Admit'])
y = df['Chance of Admit']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
# Scale numerical columns
scaler = MinMaxScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train),
columns=X_train.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test),
columns=X_test.columns)
#6. Modeling
Train a Linear Regression Model
# Linear Regression
linear_model = LinearRegression()
linear_model.fit(X_train_scaled, y_train)
y_pred = linear_model.predict(X_test_scaled)
# Evaluate Model
print("Linear Regression Performance:")
print("MAE:", mean_absolute_error(y_test, y_pred))
print("RMSE:", mean_squared_error(y_test, y_pred, squared=False))
print("R2 Score:", r2_score(y_test, y_pred))
Linear Regression Performance:
MAE: 0.043258852595452944
RMSE: 0.05959178252996559
R2 Score: 0.826348139603975
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/
_regression.py:492: FutureWarning: 'squared' is deprecated in version
1.4 and will be removed in 1.6. To calculate the root mean squared
error, use the function'root_mean_squared_error'.
warnings.warn(
Visualize Results
plt.scatter(y_test, y_pred, alpha=0.7, color='blue')
plt.plot([0, 1], [0, 1], '--', color='red')
plt.title("Actual vs Predicted - Linear Regression")
plt.xlabel("Actual Chance of Admit")
plt.ylabel("Predicted Chance of Admit")
plt.show()
Compare with Ridge and Lasso Regression
ridge_model = Ridge(alpha=0.1)
ridge_model.fit(X_train_scaled, y_train)
y_pred_ridge = ridge_model.predict(X_test_scaled)
lasso_model = Lasso(alpha=0.01)
lasso_model.fit(X_train_scaled, y_train)
y_pred_lasso = lasso_model.predict(X_test_scaled)
# Compare Performances
def evaluate_model(model_name, y_pred):
print(f"{model_name} Performance:")
print("MAE:", mean_absolute_error(y_test, y_pred))
print("RMSE:", mean_squared_error(y_test, y_pred, squared=False))
print("R2 Score:", r2_score(y_test, y_pred))
print("-" * 30)
evaluate_model("Ridge Regression", y_pred_ridge)
evaluate_model("Lasso Regression", y_pred_lasso)
Ridge Regression Performance:
MAE: 0.04333556816620076
RMSE: 0.0596480092429131
R2 Score: 0.8260202930737093
------------------------------
Lasso Regression Performance:
MAE: 0.06179644627856106
RMSE: 0.0797955620364342
R2 Score: 0.6886390356620822
------------------------------
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/
_regression.py:492: FutureWarning: 'squared' is deprecated in version
1.4 and will be removed in 1.6. To calculate the root mean squared
error, use the function'root_mean_squared_error'.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_regression.py
:492: FutureWarning: 'squared' is deprecated in version 1.4 and will
be removed in 1.6. To calculate the root mean squared error, use the
function'root_mean_squared_error'.
warnings.warn(
#7. Key Insights
CGPA, GRE Score, and TOEFL Score are the most significant predictors of admission chances.
Research experience provides a slight boost but is less impactful compared to scores.
SOP and LOR have minor contributions to the prediction.
Feature Importance (Linear Model Coefficients)
coefficients = pd.Series(linear_model.coef_,
index=X_train_scaled.columns).sort_values(ascending=False)
coefficients.plot(kind='barh', title='Feature Importance')
plt.show()
#8. Recommendations
Emphasize Academic Excellence: Students should focus on improving CGPA, GRE Score, and
TOEFL Score to maximize admission chances.
Encourage Research Participation: Research experience, while less significant, can be a
differentiator in competitive scenarios.
Refine Prediction Model: Consider dropping or de-emphasizing SOP in assessments as its
contribution is minimal.