First Order Statistics Based Feature Selection: A Diverse and Powerful Family of Feature Seleciton Techniques

Author

Khoshgoftaar, Taghi ; Dittman, D. ; Wald, Randall ; Fazelpour, Alireza

Author_Institution

Florida Atlantic Univ., Boca Raton, FL, USA

Volume

2

fYear

2012

fDate

12-15 Dec. 2012

Firstpage

151

Lastpage

157

Abstract

Dimensionality reduction techniques have become a required step when working with bioinformatics datasets. Techniques such as feature selection have been known to not only improve computation time, but to improve the results of experiments by removing the redundant and irrelevant features or genes from consideration in subsequent analysis. Univariate feature selection techniques in particular are well suited for the large levels of high dimensionality that are inherent in bioinformatics datasets (for example: DNA microarray datasets) due to their intuitive output (a ranked lists of features or genes) and their relatively small computational time compared to other techniques. This paper presents seven univariate feature selection techniques and collects them into a single family entitled First Order Statistics (FOS) based feature selection. These seven all share the trait of using first order statistical measures such as mean and standard deviation, although this is the first work to relate them to one another and consider their performance compared with one another. In order to examine the properties of these seven techniques we performed a series of similarity and classification experiments on eleven DNA microarray datasets. Our results show that in general, each feature selection technique will create diverse feature subsets when compared to the other members of the family. However when we look at classification we find that, with one exception, the techniques will produce good classification results and that the techniques will have similar performances to each other. Our recommendation, is to use the rankers Signal-to-Noise and SAM for the best classification results and to avoid Fold Change Ratio as it is consistently the worst performer of the seven rankers.

Keywords

bioinformatics; statistical analysis; DNA microarray datasets; FOS based feature selection; SAM; bioinformatics datasets; dimensionality reduction techniques; first order statistical measures; first order statistics based feature selection; fold change ratio; mean deviation; signal-to-noise; standard deviation; subsequent analysis; univariate feature selection techniques; Bioinformatics; DNA; Logistics; Measurement; Standards; Support vector machines; Vegetation; Classification; DNA Microarray; Feature Selection;

fLanguage

English

Publisher

ieee

Conference_Titel

Machine Learning and Applications (ICMLA), 2012 11th International Conference on

Conference_Location

Boca Raton, FL

Print_ISBN

978-1-4673-4651-1

Type

conf

DOI

10.1109/ICMLA.2012.192

Filename

6406743