In this paper, we approach the nationwide issue of high school dropouts, specific to the case stu... more In this paper, we approach the nationwide issue of high school dropouts, specific to the case study of Montgomery County Public Schools, a large school system in the Washington D.C. metro area. Utilizing data from this school district, we create a scalable ranking system that will allow MCPS to target their interventions to individuals most immediately at risk for dropping out of high school. Through our analysis, we show how techniques from machine learning provide a more effective tool for school administrators compared to their current methods. We suggest pathways for current policy improvement and additional avenues for future research.
We prepared the data by imputing missing values first on the individual level, then using the class means where individual imputation wasn’t possible. We generated binary features to replace the categorical features and generated some new features of our own. We tested a variety of machine learning classifiers on the first cohort using K-Folds cross-validation and found that logistic regression and random forest gave us the best results. Using those two classifiers, we tested on the second cohort of data and evaluated our model using precision-recall curves and found that logistic regression was the best classifier in terms of both recall and precision.
In this paper, we approach the nationwide issue of high school dropouts, specific to the case stu... more In this paper, we approach the nationwide issue of high school dropouts, specific to the case study of Montgomery County Public Schools, a large school system in the Washington D.C. metro area. Utilizing data from this school district, we create a scalable ranking system that will allow MCPS to target their interventions to individuals most immediately at risk for dropping out of high school. Through our analysis, we show how techniques from machine learning provide a more effective tool for school administrators compared to their current methods. We suggest pathways for current policy improvement and additional avenues for future research.
We prepared the data by imputing missing values first on the individual level, then using the class means where individual imputation wasn’t possible. We generated binary features to replace the categorical features and generated some new features of our own. We tested a variety of machine learning classifiers on the first cohort using K-Folds cross-validation and found that logistic regression and random forest gave us the best results. Using those two classifiers, we tested on the second cohort of data and evaluated our model using precision-recall curves and found that logistic regression was the best classifier in terms of both recall and precision.
Uploads
We prepared the data by imputing missing values first on the individual level, then using the class means where individual imputation wasn’t possible. We generated binary features to replace the categorical features and generated some new features of our own. We tested a variety of machine learning classifiers on the first cohort using K-Folds cross-validation and found that logistic regression and random forest gave us the best results. Using those two classifiers, we tested on the second cohort of data and evaluated our model using precision-recall curves and found that logistic regression was the best classifier in terms of both recall and precision.
We prepared the data by imputing missing values first on the individual level, then using the class means where individual imputation wasn’t possible. We generated binary features to replace the categorical features and generated some new features of our own. We tested a variety of machine learning classifiers on the first cohort using K-Folds cross-validation and found that logistic regression and random forest gave us the best results. Using those two classifiers, we tested on the second cohort of data and evaluated our model using precision-recall curves and found that logistic regression was the best classifier in terms of both recall and precision.