keyboard_arrow_down Step 1: Exploratory Data Analysis (EDA)
Let's begin by examining the dataset to understand its structure and the relationships between features and the target variable
(Dropout/Graduate).
path= '/content/drive/MyDrive/dataset.csv'
import pandas as pd
data= pd.read_csv(path)
data
Marital Application Application Daytime/evening Previous
Course Nacion
status mode order attendance qualification
0 1 8 5 2 1 1
1 1 6 1 11 1 1
2 1 1 5 5 1 1
3 1 8 2 15 1 1
4 2 12 1 3 0 1
... ... ... ... ... ... ...
4419 1 1 6 15 1 1
4420 1 1 2 15 1 1
4421 1 1 1 12 1 1
4422 1 1 1 9 1 1
4423 1 5 1 15 1 1
4424 rows × 35 columns
# Display summary statistics
data.describe()
Marital Application Application Daytime/evening Previou
Course
status mode order attendance qualificatio
count 4424.000000 4424.000000 4424.000000 4424.000000 4424.000000 4424.00000
mean 1.178571 6.886980 1.727848 9.899186 0.890823 2.53142
std 0.605747 5.298964 1.313793 4.331792 0.311897 3.96370
min 1.000000 1.000000 0.000000 1.000000 0.000000 1.00000
25% 1.000000 1.000000 1.000000 6.000000 1.000000 1.00000
50% 1.000000 8.000000 1.000000 10.000000 1.000000 1.00000
75% 1.000000 12.000000 2.000000 13.000000 1.000000 1.00000
max 6.000000 18.000000 9.000000 17.000000 1.000000 17.00000
8 rows × 34 columns
# Display data types of each column
data.dtypes
0
Marital status int64
Application mode int64
Application order int64
Course int64
Daytime/evening attendance int64
Previous qualification int64
Nacionality int64
Mother's qualification int64
Father's qualification int64
Mother's occupation int64
Father's occupation int64
Displaced int64
Educational special needs int64
Debtor int64
Tuition fees up to date int64
Gender int64
Scholarship holder int64
Age at enrollment int64
International int64
Curricular units 1st sem (credited) int64
Curricular units 1st sem (enrolled) int64
Curricular units 1st sem (evaluations) int64
Curricular units 1st sem (approved) int64
Curricular units 1st sem (grade) float64
Curricular units 1st sem (without evaluations) int64
Curricular units 2nd sem (credited) int64
Curricular units 2nd sem (enrolled) int64
Curricular units 2nd sem (evaluations) int64
Curricular units 2nd sem (approved) int64
Curricular units 2nd sem (grade) float64
Curricular units 2nd sem (without evaluations) int64
# Check for missing values
data.isnull().sum()
Application mode 0
Application order 0
Course 0
Daytime/evening attendance 0
Previous qualification 0
Nacionality 0
Mother's qualification 0
Father's qualification 0
Mother's occupation 0
Father's occupation 0
Displaced 0
Educational special needs 0
Debtor 0
Tuition fees up to date 0
Gender 0
Scholarship holder 0
Age at enrollment 0
International 0
Curricular units 1st sem (credited) 0
Curricular units 1st sem (enrolled) 0
Curricular units 1st sem (evaluations) 0
Curricular units 1st sem (approved) 0
Curricular units 1st sem (grade) 0
Curricular units 1st sem (without evaluations) 0
Curricular units 2nd sem (credited) 0
Curricular units 2nd sem (enrolled) 0
Curricular units 2nd sem (evaluations) 0
Curricular units 2nd sem (approved) 0
Curricular units 2nd sem (grade) 0
Curricular units 2nd sem (without evaluations) 0
Unemployment rate 0
Inflation rate 0
keyboard_arrow_down Step 2: Data Visualization
We will create various charts to visualize the data.
Scatter Plot
Let's create a scatter plot to see the relationship between the " Curricular units 2nd sem (grade) " and the " Target ".
import matplotlib.pyplot as plt
plt.scatter(data['Curricular units 2nd sem (grade)'], data['Target'])
plt.xlabel('Curricular units 2nd sem (grade)')
plt.ylabel('Target')
plt.title('Scatter Plot of Curricular units 2nd sem (grade) vs. Target'
plt show()
Bar Chart
Let's create a bar chart for the " Marital status " feature.
data['Marital status'].value_counts().plot(kind='bar')
plt.xlabel('Marital Status')
plt.ylabel('Count')
plt.title('Bar Chart of Marital Status')
plt.show()
Box Plot
Let's create a box plot for the " Curricular units 2nd sem (grade) " feature.
data.boxplot(column='Curricular units 2nd sem (grade)')
plt.title('Box Plot of Curricular units 2nd sem (grade)')
plt.show()
Histogram
Let's create a histogram for the " Curricular units 2nd sem (grade) " feature.
data['Curricular units 2nd sem (grade)'].hist()
plt.xlabel('Curricular units 2nd sem (grade)')
plt.ylabel('Frequency')
plt.title('Histogram of Curricular units 2nd sem (grade)')
plt.show()
keyboard_arrow_down Step 3: Data Preprocessing
We will preprocess the data, handling missing values, encoding categorical variables, and splitting the data into training and testing sets.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
# Encode the target variable
label_encoder = LabelEncoder()
data['Target'] = label_encoder.fit_transform(data['Target'])
# Define the features (X) and the target (y)
X = data.drop('Target', axis=1)
y = data['Target']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
keyboard_arrow_down Step 4: Model Building
We will build and train a decision tree model to predict student dropout rates.
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
# Build and train the model
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{report}')
Accuracy: 0.6813559322033899
Classification Report:
precision recall f1-score support
0 0.77 0.66 0.71 316
1 0.34 0.39 0.36 151
2 0.76 0.81 0.78 418
accuracy 0.68 885
macro avg 0.62 0.62 0.62 885
weighted avg 0.69 0.68 0.68 885
def get_user_input_and_predict(model, feature_columns):
user_input = {}
for column in feature_columns:
user_input[column] = [input(f"Enter value for {column}: ")]
# Create a DataFrame for user inputs
input_df = pd.DataFrame(user_input)
# Handle any necessary preprocessing (e.g., converting to numeric)
for column in feature_columns:
if X[column].dtype in ['int64', 'float64']:
input_df[column] = pd.to_numeric(input_df[column])
# Predict using the trained model
prediction = model.predict(input_df)
# Decode the prediction
decoded_prediction = label_encoder.inverse_transform(prediction)
return decoded_prediction[0]
pred= model.predict(X_test)
# Dictionary for mapping encoded target values to original labels
target_mapping = {0: 'Dropout', 1: 'Enrolled', 2: 'Graduate'}
output= target_mapping[pred[0]]
original=target_mapping[y_pred[0]]
Comparing Values
print(f"Original Value: '{original}' and Predicted Value: '{output}'")
Original Value: 'Dropout' and Predicted Value: 'Dropout'
feature_columns = X.columns
# Predict on user inputs
predicted_class = get_user_input_and_predict(model, feature_columns)
predicted_class= target_mapping[predicted_class]
print(f"The predicted class is: {predicted_class}")
Enter value for Marital status: 1
Enter value for Application mode: 8
Enter value for Application order: 5
Enter value for Course: 2
Enter value for Daytime/evening attendance: 1
Enter value for Previous qualification: 1
Enter value for Nacionality: 1
Enter value for Mother's qualification: 1
Enter value for Father's qualification: 10
Enter value for Mother's occupation: 6
Enter value for Father's occupation: 10
Enter value for Displaced: 1
Enter value for Educational special needs: 0
Enter value for Debtor: 0
Enter value for Tuition fees up to date: 1
Enter value for Gender: 1
Enter value for Scholarship holder: 0
Enter value for Age at enrollment: 20
Enter value for International: 0
Enter value for Curricular units 1st sem (credited): 0
Enter value for Curricular units 1st sem (enrolled): 0
Enter value for Curricular units 1st sem (evaluations): 0
Enter value for Curricular units 1st sem (approved): 0
Enter value for Curricular units 1st sem (grade): 0
Enter value for Curricular units 1st sem (without evaluations): 0
Enter value for Curricular units 2nd sem (credited): 0
Enter value for Curricular units 2nd sem (enrolled): 0
Enter value for Curricular units 2nd sem (evaluations): 0
Enter value for Curricular units 2nd sem (approved): 0
Enter value for Curricular units 2nd sem (grade): 0
Enter value for Curricular units 2nd sem (without evaluations): 0
Enter value for Unemployment rate: 10.8
Enter value for Inflation rate: 1.4
Enter value for GDP: 1.74
The predicted class is: Dropout