[go: up one dir, main page]

0% found this document useful (0 votes)
153 views10 pages

Jamboree

The Jamboree Case Study involves analyzing a dataset of 500 admissions records to identify factors influencing the 'Chance of Admit'. Key findings indicate that GRE Score, TOEFL Score, and CGPA are the most significant predictors, while Research experience has a minor positive impact. The study recommends focusing on academic excellence and research participation to enhance admission chances.

Uploaded by

rohitht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
153 views10 pages

Jamboree

The Jamboree Case Study involves analyzing a dataset of 500 admissions records to identify factors influencing the 'Chance of Admit'. Key findings indicate that GRE Score, TOEFL Score, and CGPA are the most significant predictors, while Research experience has a minor positive impact. The study recommends focusing on academic excellence and research participation to enhance admission chances.

Uploaded by

rohitht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Jamboree Case Study

#1. Load and Explore the Data

!gdown
https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/001/839
/original/Jamboree_Admission.csv

Downloading...
From:
https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/001/839
/original/Jamboree_Admission.csv
To: /content/Jamboree_Admission.csv
0% 0.00/16.2k [00:00<?, ?B/s] 100% 16.2k/16.2k [00:00<00:00,
44.4MB/s]

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_absolute_error, mean_squared_error,
r2_score
from statsmodels.api import OLS, add_constant

# Load the dataset


df = pd.read_csv('Jamboree_Admission.csv')

# Check dataset information


print("Dataset Information:")
print(df.info())

# Check for missing values and duplicates


print("\nMissing Values per Column:\n", df.isna().sum())
print("\nDuplicate Rows:", df.duplicated().sum())

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Serial No. 500 non-null int64
1 GRE Score 500 non-null int64
2 TOEFL Score 500 non-null int64
3 University Rating 500 non-null int64
4 SOP 500 non-null float64
5 LOR 500 non-null float64
6 CGPA 500 non-null float64
7 Research 500 non-null int64
8 Chance of Admit 500 non-null float64
dtypes: float64(4), int64(5)
memory usage: 35.3 KB
None

Missing Values per Column:


Serial No. 0
GRE Score 0
TOEFL Score 0
University Rating 0
SOP 0
LOR 0
CGPA 0
Research 0
Chance of Admit 0
dtype: int64

Duplicate Rows: 0

#2. Data Cleaning and Optimization

# Drop "Serial No." column


df = df.drop(columns=["Serial No."])

# Rename columns for consistency


df.rename(columns={'LOR ': 'LOR', 'Chance of Admit ': 'Chance of
Admit'}, inplace=True)

# Optimize Data Types


df['GRE Score'] = df['GRE Score'].astype('int16')
df['TOEFL Score'] = df['TOEFL Score'].astype('int8')
df['University Rating'] = df['University Rating'].astype('int8')
df['SOP'] = df['SOP'].astype('float32')
df['LOR'] = df['LOR'].astype('float32')
df['CGPA'] = df['CGPA'].astype('float32')
df['Research'] = df['Research'].astype('bool')
df['Chance of Admit'] = df['Chance of Admit'].astype('float32')

print("Optimized Dataset Information:")


print(df.info())

#3. Exploratory Data Analysis (EDA)

Summary Statistics
print("\nSummary Statistics:")
print(df.describe())

Summary Statistics:
Serial No. GRE Score TOEFL Score University Rating
SOP \
count 500.000000 500.000000 500.000000 500.000000
500.000000
mean 250.500000 316.472000 107.192000 3.114000
3.374000
std 144.481833 11.295148 6.081868 1.143512
0.991004
min 1.000000 290.000000 92.000000 1.000000
1.000000
25% 125.750000 308.000000 103.000000 2.000000
2.500000
50% 250.500000 317.000000 107.000000 3.000000
3.500000
75% 375.250000 325.000000 112.000000 4.000000
4.000000
max 500.000000 340.000000 120.000000 5.000000
5.000000

LOR CGPA Research Chance of Admit


count 500.00000 500.000000 500.000000 500.00000
mean 3.48400 8.576440 0.560000 0.72174
std 0.92545 0.604813 0.496884 0.14114
min 1.00000 6.800000 0.000000 0.34000
25% 3.00000 8.127500 0.000000 0.63000
50% 3.50000 8.560000 1.000000 0.72000
75% 4.00000 9.040000 1.000000 0.82000
max 5.00000 9.920000 1.000000 0.97000

Check Distributions of Numerical Variables

# Rename columns to remove any trailing spaces


df.rename(columns=lambda x: x.strip(), inplace=True)

# Visualize numerical distributions


numerical_columns = ['GRE Score', 'TOEFL Score', 'CGPA', 'Chance of
Admit']
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
for col, ax in zip(numerical_columns, axes.flatten()):
sns.histplot(df[col], kde=True, ax=ax)
ax.set_title(f"Distribution of {col}")
plt.tight_layout()
plt.show()
Categorical Variables

# Pie charts and counts


categorical_columns = ['University Rating', 'SOP', 'LOR', 'Research']
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
for col, ax in zip(categorical_columns, axes.flatten()):
if col == 'Research':
data = df[col].value_counts()
ax.pie(data, labels=['No Research', 'Research'], autopct='%.1f
%%', startangle=90)
ax.set_title("Research Experience")
else:
sns.countplot(x=df[col], ax=ax, palette='coolwarm')
ax.set_title(col)
plt.tight_layout()
plt.show()

<ipython-input-8-a09489fe7f28>:10: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be


removed in v0.14.0. Assign the `x` variable to `hue` and set
`legend=False` for the same effect.

sns.countplot(x=df[col], ax=ax, palette='coolwarm')


<ipython-input-8-a09489fe7f28>:10: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be
removed in v0.14.0. Assign the `x` variable to `hue` and set
`legend=False` for the same effect.

sns.countplot(x=df[col], ax=ax, palette='coolwarm')


<ipython-input-8-a09489fe7f28>:10: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be


removed in v0.14.0. Assign the `x` variable to `hue` and set
`legend=False` for the same effect.

sns.countplot(x=df[col], ax=ax, palette='coolwarm')

#4. Insights from Correlation Analysis

Heatmap
# Correlation Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f',
linewidths=0.5)
plt.title("Feature Correlation Heatmap")
plt.show()

Key Insights: GRE Score, TOEFL Score, and CGPA show strong positive correlations with the
Chance of Admit.

Research is positively correlated but weaker compared to numerical scores.

Minimal correlation between independent variables indicates no multicollinearity issues.

#5. Feature Engineering


# Separate dependent and independent variables
X = df.drop(columns=['Chance of Admit'])
y = df['Chance of Admit']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

# Scale numerical columns


scaler = MinMaxScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train),
columns=X_train.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test),
columns=X_test.columns)

#6. Modeling

Train a Linear Regression Model

# Linear Regression
linear_model = LinearRegression()
linear_model.fit(X_train_scaled, y_train)
y_pred = linear_model.predict(X_test_scaled)

# Evaluate Model
print("Linear Regression Performance:")
print("MAE:", mean_absolute_error(y_test, y_pred))
print("RMSE:", mean_squared_error(y_test, y_pred, squared=False))
print("R2 Score:", r2_score(y_test, y_pred))

Linear Regression Performance:


MAE: 0.043258852595452944
RMSE: 0.05959178252996559
R2 Score: 0.826348139603975

/usr/local/lib/python3.10/dist-packages/sklearn/metrics/
_regression.py:492: FutureWarning: 'squared' is deprecated in version
1.4 and will be removed in 1.6. To calculate the root mean squared
error, use the function'root_mean_squared_error'.
warnings.warn(

Visualize Results

plt.scatter(y_test, y_pred, alpha=0.7, color='blue')


plt.plot([0, 1], [0, 1], '--', color='red')
plt.title("Actual vs Predicted - Linear Regression")
plt.xlabel("Actual Chance of Admit")
plt.ylabel("Predicted Chance of Admit")
plt.show()
Compare with Ridge and Lasso Regression

ridge_model = Ridge(alpha=0.1)
ridge_model.fit(X_train_scaled, y_train)
y_pred_ridge = ridge_model.predict(X_test_scaled)

lasso_model = Lasso(alpha=0.01)
lasso_model.fit(X_train_scaled, y_train)
y_pred_lasso = lasso_model.predict(X_test_scaled)

# Compare Performances
def evaluate_model(model_name, y_pred):
print(f"{model_name} Performance:")
print("MAE:", mean_absolute_error(y_test, y_pred))
print("RMSE:", mean_squared_error(y_test, y_pred, squared=False))
print("R2 Score:", r2_score(y_test, y_pred))
print("-" * 30)

evaluate_model("Ridge Regression", y_pred_ridge)


evaluate_model("Lasso Regression", y_pred_lasso)

Ridge Regression Performance:


MAE: 0.04333556816620076
RMSE: 0.0596480092429131
R2 Score: 0.8260202930737093
------------------------------
Lasso Regression Performance:
MAE: 0.06179644627856106
RMSE: 0.0797955620364342
R2 Score: 0.6886390356620822
------------------------------

/usr/local/lib/python3.10/dist-packages/sklearn/metrics/
_regression.py:492: FutureWarning: 'squared' is deprecated in version
1.4 and will be removed in 1.6. To calculate the root mean squared
error, use the function'root_mean_squared_error'.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_regression.py
:492: FutureWarning: 'squared' is deprecated in version 1.4 and will
be removed in 1.6. To calculate the root mean squared error, use the
function'root_mean_squared_error'.
warnings.warn(

#7. Key Insights

CGPA, GRE Score, and TOEFL Score are the most significant predictors of admission chances.

Research experience provides a slight boost but is less impactful compared to scores.

SOP and LOR have minor contributions to the prediction.

Feature Importance (Linear Model Coefficients)

coefficients = pd.Series(linear_model.coef_,
index=X_train_scaled.columns).sort_values(ascending=False)
coefficients.plot(kind='barh', title='Feature Importance')
plt.show()
#8. Recommendations

Emphasize Academic Excellence: Students should focus on improving CGPA, GRE Score, and
TOEFL Score to maximize admission chances.

Encourage Research Participation: Research experience, while less significant, can be a


differentiator in competitive scenarios.

Refine Prediction Model: Consider dropping or de-emphasizing SOP in assessments as its


contribution is minimal.

You might also like