Title :
Comparison of approaches to alleviate problems with high-dimensional and class-imbalanced data
Author :
Shanab, Ahmad Abu ; Khoshgoftaar, Taghi M. ; Wald, Randall ; Van Hulse, Jason
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL, USA
Abstract :
Two of the most challenging problems in data mining are working with imbalanced datasets and with datasets which have a large number of attributes. In this study we compare three different approaches for handling both class imbalance and high dimensionality simultaneously. The first approach consists of sampling followed by feature selection, with the training data being built using the selected features and the original (unsampled) data. The second approach is similar, except that it uses the sampled data (and selected features) to build the training data. In the third approach, feature selection takes place before sampling, and the training data is based on the sampled data. To compare these three approaches, we use seven groups of datasets covering different application domains, employ nine feature rankers from three different families, and generate artificial class noise to better simulate real-world datasets. The results differ from an earlier work and show that the first and third approaches perform, on average, better than the second approach.
Keywords :
data mining; artificial class noise; data mining; data sample; dataset group; feature selection; imbalanced datasets; real-world datasets; training data; Cancer; Entropy; Lungs; Measurement; Noise; Support vector machines; Training;
Conference_Titel :
Information Reuse and Integration (IRI), 2011 IEEE International Conference on
Conference_Location :
Las Vegas, NV
Print_ISBN :
978-1-4577-0964-7
Electronic_ISBN :
978-1-4577-0965-4
DOI :
10.1109/IRI.2011.6009552