PREDICTION OF HIGH SCHOOL DROPOUT RATES USING MACHINE LEARNING
by Ira Chaturvedi
Recently, graduation rates have been increasing while dropout rates have remained stagnant. This project looks to lower these dropout rates with the help of machine learning to predict the number of students at risk of dropping out. For this project, student enrollment and dropout data for schools across the California and the median income of families by ethnicity for each county in the state were first collected and then preprocessed to extract the desired input variables for the machine learning algorithm. Once data had been curated, feature engineering was performed and the data was then classified into different datasets for training, testing and validation of the model. The model was initially developed using linear regression, but the result accuracy was not high enough to fall into acceptable range. Instead an alternative algorithm, XGBoost (Extreme Gradient Boost) linear regression, was used and it provided better accuracy for its predictions. The output provides a prediction value of students who are likely to dropout for each ethnicity and gender among the total number of students enrolled in the school. This prototype’s ability to predict these results can be very useful for school administrators as they can use these numbers to see which students are at risk for dropping out. It can lower the number of students who dropout since school administrators can intervene earlier and prevent them from dropping out. Additionally, the same algorithm can be extended to predict the likelihood of individual student dropout, given access to that dataset.
The Extreme Gradient Boosting algorithm (XGBoost) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. It works via parallel processing, tree-pruning (reducing the size of the decision trees by removing sections that have little power to classifying instances), handling missing values (occurs in preprocessing), and regularization to avoid overfitting (when a model is overtrained on one dataset and has wonderful accuracy with that data, but not with other datasets). Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler, weaker models. It robustly handles a variety of data types, relationships, and distributions, and because of the large number of hyperparameters that can be tweaked and tuned for improved fits. This flexibility makes XGBoost a solid choice for problems in regression, classification (binary and multiclass), and ranking. Additionally, it optimizes the hardware and is much faster than other algorithms such as Logistic Regression, Random Forest, and Gradient Boosting, making it the best model for this project.
Below, I have attached the link for the website, please contact me if you would like to see the website as I have stopped running the site because it is hosted on Amazon Cloud and running it costs money.
The Extreme Gradient Boosting algorithm (XGBoost) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. It works via parallel processing, tree-pruning (reducing the size of the decision trees by removing sections that have little power to classifying instances), handling missing values (occurs in preprocessing), and regularization to avoid overfitting (when a model is overtrained on one dataset and has wonderful accuracy with that data, but not with other datasets). Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler, weaker models. It robustly handles a variety of data types, relationships, and distributions, and because of the large number of hyperparameters that can be tweaked and tuned for improved fits. This flexibility makes XGBoost a solid choice for problems in regression, classification (binary and multiclass), and ranking. Additionally, it optimizes the hardware and is much faster than other algorithms such as Logistic Regression, Random Forest, and Gradient Boosting, making it the best model for this project.