1. What is Software Defect Prediction (SDP)?
Software Defect Prediction (SDP) is a crucial technique that identifies potential software defects before they occur. In software engineering, SDP is an important and challenging task. Better software quality and reduced development costs are both linked to early defect detection in software development. Recently, machine learning models have been widely used to detect defects in software. Predicting software defects using machine learning models has been demonstrated to be useful in several studies, such as decision tree, Naive Bayes, K-nearest Neighbors, Artificial Neural Network, and Support Vector Machine. However, almost all studies in SDP using machine learning models did not perform hyperparameter tuning to obtain the optimal model hyperparameters. Another issue in SDP using a machine learning model is selecting prominent features to use as input to the classifier. Principal Component Analysis (PCA) is another approach that can be used to reduce irrelevant features. To achieve optimal classification performance, it is necessary to determine the optimal number of selected components. This paper compares the optimized machine learning models for SDP on NASA MDP datasets. Some traditional machine learning models were used to classify 12 datasets from NASA MDP datasets. The hyperparameters of the model were optimized using random search to obtain the best classifier for each dataset. The dimensionality of the features was reduced using PCA. The number of selected components was also optimized using random search. The synthetic minority oversampling technique (SMOTE) was used as an oversampling strategy for the minority class to deal with unbalanced samples on NASA MDP datasets.
read more
2. How do machine learning models perform in SDP prediction?
Machine learning models have shown varying performance in SDP prediction. Iqbal et al. found that RF achieved the highest performance based on ROC area score. Ji et al. proposed a weighted naive Bayes classifier with improved prediction accuracy. Marian et al. claimed that fuzzy decision trees outperformed standard DT in AUC scores, but lacked sufficient information. Hammad et al. achieved an accuracy rate of up to 87% using k-NN. Kumar et al. proposed FCM-GM, which outperformed traditional FCM and k-NN. Rong et al. used SVM with CBA optimization, outperforming other classifiers. Jayanthi et al. achieved an AUC of 97.20% using an ANN and enhanced PCA. However, most studies did not perform hyperparameter tuning, and the number of PCA components was not optimized, affecting the best performance of each model.
read more
3. What is the NASA MDP dataset used for?
The NASA MDP dataset is used by software engineers to examine the relationship between software metrics and software defects. It includes information on 24 NASA software projects, such as the number of defects, code base size, and effort required to develop the software. The dataset comprises both public and confidential data, with the public data accessible to all and the confidential data accessible only to authorized users. The clean version of the NASA MDP datasets from D' collection was used in this research, consisting of datasets from 12 projects, namely CM1, JM1, KC1, KC3, MC1, MC2, MW1, PC1, PC2, PC3, PC4, and PC5. The number of features varies in each dataset, but they all have the same number of classes, defective Y and defective N.
read more
4. What is the oversampling strategy used in this research?
The research applied an oversampling strategy using the synthetic minority oversampling technique (SMOTE). SMOTE is a popular technique used to address class imbalance problems in machine learning. It works by creating synthetic instances of the minority class by interpolating between the instances of the minority class. Specifically, SMOTE selects a minority class instance and finds its k-nearest neighbors in the feature space. It then creates synthetic instances by randomly selecting one of the k-neighbors and interpolating between the minority sample and the selected neighbor. This creates a new instance that is similar to the minority class but is not an exact copy of any existing instance.
read more