Comparing Different Resampling Methods in Predicting Students’ Performance Using Machine Learning Techniques
Ramin Ghorbani,Rouzbeh Ghousi +1 more
TL;DR: This paper attempts to compare various resampling techniques to handle the imbalanced data problem while predicting students’ performance using two different datasets, and the Random Forest classifier has achieved the best result among all other models while using SVM-SMOTE as a resamplings method.
read more
Abstract: In today's world, due to the advancement of technology, predicting the students' performance is among the most beneficial and essential research topics. Data Mining is extremely helpful in the field of education, especially for analyzing students' performance. It is a fact that predicting the students' performance has become a severe challenge because of the imbalanced datasets in this field, and there is not any comparison among different resampling methods. This paper attempts to compare various resampling techniques such as Borderline SMOTE, Random Over Sampler, SMOTE, SMOTE-ENN, SVM-SMOTE, and SMOTE-Tomek to handle the imbalanced data problem while predicting students' performance using two different datasets. Moreover, the difference between multiclass and binary classification, and structures of the features are examined. To be able to check the performance of the resampling methods better in solving the imbalanced problem, this paper uses various machine learning classifiers including Random Forest, K-Nearest-Neighbor, Artificial Neural Network, XG-boost, Support Vector Machine (Radial Basis Function), Decision Tree, Logistic Regression, and Naive Bayes. Furthermore, the Random hold-out and Shuffle 5-fold cross-validation methods are used as model validation techniques. The achieved results using different evaluation metrics indicate that fewer numbers of classes and nominal features will lead models to better performance. Also, classifiers do not perform well with imbalanced data, so solving this problem is necessary. The performance of classifiers is improved using balanced datasets. Additionally, the results of the Friedman test, which is a statistical significance test, confirm that the SVM-SMOTE is more efficient than the other resampling methods. Moreover, The Random Forest classifier has achieved the best result among all other models while using SVM-SMOTE as a resampling method.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
A Systematic Literature Review of Student’ Performance Prediction Using Machine Learning Techniques
TL;DR: The review results indicated that various Machine Learning techniques are used to understand and overcome the underlying challenges; predicting students at risk and students drop out prediction and improving the students’ performance.
236
Separating emission and meteorological contributions to long-term PM 2.5 trends over eastern China during 2000–2018
Qingyang Xiao,Yixuan Zheng,Guannan Geng,Cuihong Chen,Xiaomeng Huang,Huizheng Che,Xiaoye Zhang,Kebin He,Qiang Zhang +8 more
TL;DR: In this article, a combination of a machine learning model, statistical method, and chemical transport model was used to quantify the meteorological impacts on PM 2.5 pollution during 2000-2018.
Prediction of Students’ Academic Performance Based on Courses’ Grades Using Deep Neural Networks
TL;DR: In this paper, a dataset collected from a public 4-year university was used to develop predictive models to predict students' academic performance of upcoming courses given their grades in the previous courses of the first academic year using a deep neural network.
A mini-review of machine learning in big data analytics: Applications, challenges, and prospects
TL;DR: In this article , a comprehensive mini-literature review of ML in Big Data Analytics (BDA) using a keyword search was presented, where a total of 1512 published articles were screened to 140 based on the proposed novel taxonomy.
101
A comprehensive comparison among metaheuristics (MHs) for geohazard modeling using machine learning: Insights from a case study of landslide displacement prediction
TL;DR: In this article , a systematic framework combining k-fold cross-validation (CV), metaheuristics (MHs), support vector regression (SVR), and Friedman and Nemenyi tests was proposed to improve the reliability and performance of geohazard modeling.
98
References
Applied Logistic Regression.
TL;DR: Applied Logistic Regression, Third Edition provides an easily accessible introduction to the logistic regression model and highlights the power of this model by examining the relationship between a dichotomous outcome and a set of covariables.
40.1K
SMOTE: synthetic minority over-sampling technique
TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
•Journal Article
Statistical Comparisons of Classifiers over Multiple Data Sets
TL;DR: A set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers is recommended: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparisons of more classifiers over multiple data sets.
SMOTE: Synthetic Minority Over-sampling Technique
TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Least Squares Support Vector Machine Classifiers
TL;DR: A least squares version for support vector machine (SVM) classifiers that follows from solving a set of linear equations, instead of quadratic programming for classical SVM's.