Software Defect Prediction Based on Optimized Machine Learning Models: A Comparative Study

Question

1. What is Software Defect Prediction (SDP)?

2. How do machine learning models perform in SDP prediction?

3. What is the NASA MDP dataset used for?

4. What is the oversampling strategy used in this research?

Accepted Answer

Software Defect Prediction (SDP) is a crucial technique that identifies potential software defects before they occur. In software engineering, SDP is an important and challenging task. Better software quality and reduced development costs are both linked to early defect detection in software development. Recently, machine learning models have been widely used to detect defects in software. Predicting software defects using machine learning models has been demonstrated to be useful in several studies, such as decision tree, Naive Bayes, K-nearest Neighbors, Artificial Neural Network, and Support Vector Machine. However, almost all studies in SDP using machine learning models did not perform hyperparameter tuning to obtain the optimal model hyperparameters. Another issue in SDP using a machine learning model is selecting prominent features to use as input to the classifier. Principal Component Analysis (PCA) is another approach that can be used to reduce irrelevant features. To achieve optimal classification performance, it is necessary to determine the optimal number of selected components. This paper compares the optimized machine learning models for SDP on NASA MDP datasets. Some traditional machine learning models were used to classify 12 datasets from NASA MDP datasets. The hyperparameters of the model were optimized using random search to obtain the best classifier for each dataset. The dimensionality of the features was reduced using PCA. The number of selected components was also optimized using random search. The synthetic minority oversampling technique (SMOTE) was used as an oversampling strategy for the minority class to deal with unbalanced samples on NASA MDP datasets.

Accepted Answer

Machine learning models have shown varying performance in SDP prediction. Iqbal et al. found that RF achieved the highest performance based on ROC area score. Ji et al. proposed a weighted naive Bayes classifier with improved prediction accuracy. Marian et al. claimed that fuzzy decision trees outperformed standard DT in AUC scores, but lacked sufficient information. Hammad et al. achieved an accuracy rate of up to 87% using k-NN. Kumar et al. proposed FCM-GM, which outperformed traditional FCM and k-NN. Rong et al. used SVM with CBA optimization, outperforming other classifiers. Jayanthi et al. achieved an AUC of 97.20% using an ANN and enhanced PCA. However, most studies did not perform hyperparameter tuning, and the number of PCA components was not optimized, affecting the best performance of each model.

Accepted Answer

The NASA MDP dataset is used by software engineers to examine the relationship between software metrics and software defects. It includes information on 24 NASA software projects, such as the number of defects, code base size, and effort required to develop the software. The dataset comprises both public and confidential data, with the public data accessible to all and the confidential data accessible only to authorized users. The clean version of the NASA MDP datasets from D' collection was used in this research, consisting of datasets from 12 projects, namely CM1, JM1, KC1, KC3, MC1, MC2, MW1, PC1, PC2, PC3, PC4, and PC5. The number of features varies in each dataset, but they all have the same number of classes, defective Y and defective N.

Accepted Answer

The research applied an oversampling strategy using the synthetic minority oversampling technique (SMOTE). SMOTE is a popular technique used to address class imbalance problems in machine learning. It works by creating synthetic instances of the minority class by interpolating between the instances of the minority class. Specifically, SMOTE selects a minority class instance and finds its k-nearest neighbors in the feature space. It then creates synthetic instances by randomly selecting one of the k-neighbors and interpolating between the minority sample and the selected neighbor. This creates a new instance that is similar to the minority class but is not an exact copy of any existing instance.

Accepted Answer

Principal Component Analysis (PCA) reduces dimensionality by transforming features into new variables called principal components. These components are uncorrelated and retain essential information from the original data. PCA uses the covariance matrix to determine principal components and their variance. The dataset is projected onto a lower-dimensional space by selecting the top n eigenvectors, where n is the desired number of dimensions. In this research, the value of n was determined through a random search to achieve optimal model performance. Additionally, scaled components were used to avoid the dominance of certain components, ensuring a balanced representation of the data.

Accepted Answer

In this research, six traditional machine learning models were employed for classification. These models include k-nearest neighbors (k-NN), logistic regression (LR), decision tree (DT), linear discriminant analysis (LDA), support vector machine (SVM), and single hidden layer multi-layer perceptron (SHL-MLP). Each model was optimized using random search to determine the best hyperparameters for accurate classification of the datasets.

Accepted Answer

Stratified random subsampling is used in model evaluation to ensure equal representation of each class in both training and testing datasets. It maintains the same proportion as the original dataset, reflecting its entirety. This method helps in obtaining unbiased results and accurate performance metrics. By dividing the dataset into training and testing data with a 70:30 ratio, the model's performance can be evaluated effectively using various metrics such as accuracy, precision, recall, and F1 score. This approach ensures that the model is trained and tested on a representative sample of the data, leading to more reliable and generalizable results.

Accepted Answer

k-NN achieved the highest accuracy on seven datasets, including JM1 (77.91%), KC1 (79.31%), KC3 (91.58%), MC1 (98.97%), MC2 (83.67%), PC3 (93.46%), and PC5 (83.06%).

Software Defect Prediction Based on Optimized Machine Learning Models: A Comparative Study

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What is Software Defect Prediction (SDP)?

2. How do machine learning models perform in SDP prediction?

3. What is the NASA MDP dataset used for?

4. What is the oversampling strategy used in this research?

5. How does PCA reduce dimensionality?

6. What machine learning models were used for classification?

7. How is stratified random subsampling used in model evaluation?

8. Which machine learning model achieved the highest accuracy on most datasets in software defect prediction?

Citations

Predicting Software Defects in Hybrid MPI and OpenMP Parallel Programs Using Machine Learning

Enhancing Software Quality Through Defect Prediction

AI-Driven Mental Health Predicition: A Scalable Model for Early Intervension in Higher Education Institution

Analysis of Bio Inspired Based Hybrid Learning Model for Software Defect Prediction

Deep Learning and Explainable AI for Accurate and Interpretable Software Defect Prediction

References

Scikit-learn: Machine Learning in Python

Pattern Recognition and Machine Learning

SMOTE: synthetic minority over-sampling technique

Random search for hyper-parameter optimization

Data Structures for Statistical Computing in Python

Related Papers (5)

Machine learning models in breast cancer survival prediction

Machine Learning Based Approaches for Cancer Prediction: A Survey

Performance comparison of Extreme Learning Machines and other machine learning methods on WBCD data set

Machine Learning Approaches for Human Activity Recognition Based on Multimodal Body Sensors

Liver disease prediction using machine learning and deep learning: A comparative study