Title :
A New Fixed-Overlap Partitioning Algorithm for Determining Stability of Bioinformatics Gene Rankers
Author :
Wald, Randall ; Khoshgoftaar, Taghi ; Dittman, D.
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL, USA
Abstract :
Feature (gene) selection has become an important and necessary step for combating high dimensionality, a problem found in bioinformatics datasets. Many studies have focused on gene selection, examining both the design of these techniques and the classification performance of prediction models built using these techniques. However, it is only recently that any work has focused on the robustness or stability of these gene selection techniques. Robustness is important because techniques which do not give reliable gene lists cannot be trusted to give useful genes. Previous papers studying stability typically generate multiple random sub samples of the original dataset and compare the genes chosen from these with one another, or compare the genes from the sub samples directly with the genes from the original data. These methods both have known problems, either with comparing two randomly-generated datasets with an unknown level of overlap or with comparing two datasets of different sizes. This paper introduces a new algorithm for generating sub sample datasets called fixed-overlap partitions. This will generate sub samples which have exactly the desired level of overlap and number of instances. Using this method we evaluate nineteen feature selection techniques using twenty-six real world DNA microarray datasets. Our results show that there are three rankers (Deviance, Receiver Operating Characteristic curve, and Precision-Recall Curve) which are consistently the most stable. However, the level of overlap, the quality of the data, and the number of genes selected have an effect on which ranker will be the most stable in a given situation. The fixed-overlap partitions algorithm in particular is able to find how varying levels of overlap can cause different levels of difficulty to sometimes resemble one another (for example, moderate-difficulty datasets behave like easy-difficulty datasets at low levels of overlap, but diverge as the overlap increases).
Keywords :
bioinformatics; data handling; DNA microarray datasets; bioinformatics datasets; bioinformatics gene rankers; determining stability; feature gene selection; fixed overlap partitions; gene selection techniques; new fixed overlap partitioning algorithm; Bioinformatics; DNA; Market research; Measurement; Partitioning algorithms; Stability criteria; DNA Microarray; Fixed-Overlap Partitions; Stability;
Conference_Titel :
Machine Learning and Applications (ICMLA), 2012 11th International Conference on
Conference_Location :
Boca Raton, FL
Print_ISBN :
978-1-4673-4651-1
DOI :
10.1109/ICMLA.2012.149