Abstract :
Reliability of a software counts on its fault-prone modules. This means that the less the software consists of fault-prone units, the more we may trust it. Therefore, if we are able to predict the number of fault-prone modules of a software, it will be possible to judge its reliability. In predicting the software fault-prone modules, one of the contributing features is software metric, by which one can classify he software modules into the fault-prone and non-fault-prone ones. To make such a classification, we investigated 17 classifier methods, whose features (attributes) were software metrics (39 metrics), and the mining instances (software modules) were 13 datasets reported by NASA. However, there are two important issues influencing our prediction accuracy when we use data mining methods: (1) selecting the best/most influential features (i.e. software metrics) when there is a wide diversity of them, and (2) instance sampling in order to balance the imbalanced instances of mining; we have two imbalanced classes when the classifier biases towards the majority class. Based on the feature selection and instance sampling, we considered 4 scenarios in appraisal of 17 classifier methods to predict software faultprone modules. To select features, we used correlation-based feature selection (CFS), and to sample instances, we implemented the synthetic minority oversampling technique (SMOTE).The empirical results obtained show that suitable sampling software modules significantly influences the accuracy of predicting software reliability but metric selection does not have a considerable effect on the prediction. Furthermore, among the other data classifiers, bagging, K*, and random forest are the best ones when we use the sampled instances for training data.
Keywords :
Software Fault Prediction , Classifier Performance , Feature Selection , Data Sampling , Software Metric , Dependent Variable , Independent Variable