DocumentCode
589273
Title
An Empirical Study on the Stability of Feature Selection for Imbalanced Software Engineering Data
Author
Huanjing Wang ; Khoshgoftaar, Taghi M. ; Napolitano, Antonio
Volume
1
fYear
2012
fDate
12-15 Dec. 2012
Firstpage
317
Lastpage
323
Abstract
In software quality modeling, software metrics are collected during the software development cycle. However, not all metrics are relevant to the class attribute (software quality). Metric (feature) selection has become the cornerstone of many software quality classification problems. Selecting software metrics that are important for software quality classification is a necessary and critical step before the model training process. Recently, the robustness (e.g., stability) of feature selection techniques has been studied, to examine the sensitivity of these techniques to changes (adding/removing program modules to/from their dataset). This work provides an empirical study regarding the stability of feature selection techniques across six software metrics datasets with varying levels of class balance. In this work eighteen feature selection techniques are evaluated. Moreover, three factors, feature subset size, degree of perturbation, and class balance of datasets, are considered in this study to evaluate stability of feature selection techniques. Experimental results show that these factors affect the stability of feature selection techniques as one might expect. We found that with few exceptions, feature ranking based on highly imbalanced datasets are less stable than based on slightly imbalanced data. Results also show that making smaller changes to the datasets has less impact on the stability of feature ranking techniques. Overall, we conclude that a careful understanding of one´s dataset (and certain choices of metric selection technique) can help practitioners build more reliable software quality models.
Keywords
pattern classification; software metrics; software quality; feature ranking; feature selection stability; imbalanced software engineering data; model training process; software development cycle; software metrics; software quality classification problems; software quality modeling; Indexes; Measurement; Radio frequency; Software quality; Stability criteria; feature ranking; imbalanced data; stability; subsample;
fLanguage
English
Publisher
ieee
Conference_Titel
Machine Learning and Applications (ICMLA), 2012 11th International Conference on
Conference_Location
Boca Raton, FL
Print_ISBN
978-1-4673-4651-1
Type
conf
DOI
10.1109/ICMLA.2012.60
Filename
6406682
Link To Document