DocumentCode :
1842912
Title :
A novel feature selection technique for highly imbalanced data
Author :
Khoshgoftaar, Taghi M. ; Gao, Kehan ; Van Hulse, Jason
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL, USA
fYear :
2010
fDate :
4-6 Aug. 2010
Firstpage :
80
Lastpage :
85
Abstract :
Two challenges often encountered in data mining are the presence of excessive features in a data set and unequal numbers of examples in the two classes in a binary classification problem. In this paper, we propose a novel approach to feature selection for imbalanced data in the context of software quality engineering. This technique consists of a repetitive process of data sampling followed by feature ranking and finally aggregating the results generated during the repetitive process. This repetitive feature selection method is compared with two other approaches: one uses a filter-based feature ranking technique alone on the original data, while the other uses the data sampling and feature ranking techniques together only once. The empirical validation is carried out on two groups of software data sets. The results demonstrate that our proposed repetitive feature selection method performs on average significantly better than the other two approaches, especially when the data set is highly imbalanced.
Keywords :
data mining; pattern classification; software quality; binary classification problem; data mining; data sampling; feature selection technique; filter-based feature ranking technique; highly imbalanced data; repetitive feature selection method; repetitive process; software data sets; software quality engineering; Analysis of variance; Measurement; Niobium; Radio frequency; Software quality; Support vector machines; Training data;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Reuse and Integration (IRI), 2010 IEEE International Conference on
Conference_Location :
Las Vegas, NV
Print_ISBN :
978-1-4244-8097-5
Type :
conf
DOI :
10.1109/IRI.2010.5558961
Filename :
5558961
Link To Document :
بازگشت