DocumentCode :
589309
Title :
First Order Statistics Based Feature Selection: A Diverse and Powerful Family of Feature Seleciton Techniques
Author :
Khoshgoftaar, Taghi ; Dittman, D. ; Wald, Randall ; Fazelpour, Alireza
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL, USA
Volume :
2
fYear :
2012
fDate :
12-15 Dec. 2012
Firstpage :
151
Lastpage :
157
Abstract :
Dimensionality reduction techniques have become a required step when working with bioinformatics datasets. Techniques such as feature selection have been known to not only improve computation time, but to improve the results of experiments by removing the redundant and irrelevant features or genes from consideration in subsequent analysis. Univariate feature selection techniques in particular are well suited for the large levels of high dimensionality that are inherent in bioinformatics datasets (for example: DNA microarray datasets) due to their intuitive output (a ranked lists of features or genes) and their relatively small computational time compared to other techniques. This paper presents seven univariate feature selection techniques and collects them into a single family entitled First Order Statistics (FOS) based feature selection. These seven all share the trait of using first order statistical measures such as mean and standard deviation, although this is the first work to relate them to one another and consider their performance compared with one another. In order to examine the properties of these seven techniques we performed a series of similarity and classification experiments on eleven DNA microarray datasets. Our results show that in general, each feature selection technique will create diverse feature subsets when compared to the other members of the family. However when we look at classification we find that, with one exception, the techniques will produce good classification results and that the techniques will have similar performances to each other. Our recommendation, is to use the rankers Signal-to-Noise and SAM for the best classification results and to avoid Fold Change Ratio as it is consistently the worst performer of the seven rankers.
Keywords :
bioinformatics; statistical analysis; DNA microarray datasets; FOS based feature selection; SAM; bioinformatics datasets; dimensionality reduction techniques; first order statistical measures; first order statistics based feature selection; fold change ratio; mean deviation; signal-to-noise; standard deviation; subsequent analysis; univariate feature selection techniques; Bioinformatics; DNA; Logistics; Measurement; Standards; Support vector machines; Vegetation; Classification; DNA Microarray; Feature Selection;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Applications (ICMLA), 2012 11th International Conference on
Conference_Location :
Boca Raton, FL
Print_ISBN :
978-1-4673-4651-1
Type :
conf
DOI :
10.1109/ICMLA.2012.192
Filename :
6406743
Link To Document :
بازگشت