Title :
Similarity analysis of feature ranking techniques on imbalanced DNA microarray datasets
Author :
Dittman, David ; Khoshgoftaar, Taghi ; Wald, Randall ; Napolitano, Amri
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL, USA
Abstract :
DNA microarrays are a modern advancement in the analysis of genetic data. This technology allows a researcher to test samples for thousands of genes simultaneously. However, once the samples in the DNA microarrays have been tested, the researcher must then search through the data collected and identify genes important to their problem. A possible solution to this issue is the data mining pre-processing technique called feature selection. Feature (gene) selection takes the original set of features (in the case of DNA microarrays, gene probes) and chooses an optimal subset to perform analysis from. Ideally, the reduced subset only contains the most important features as determined by the feature selection technique (or set of feature selection techniques), which allows for further research in the discovered genes. However in the case of using multiple feature selection techniques, the set of techniques must be diverse in order to reduce redundancy among the chosen features. Another benefit of increasing diversity is that any features chosen across a diverse set of feature selection techniques will have more importance than those chosen by a single technique or a set of related ones. Therefore, it would be useful to know how similar the feature selection techniques are to each other. In this study we perform an analysis of eighteen feature selection techniques across nine imbalanced DNA microarray datasets and using four feature subset sizes. Our results found that one should not use Gini Index and Probability Ratio together or the Kolmogorov-Smirnov statistic and Geometric Mean together at any feature subset size in order to minimize redundancy, and that the members of the first of these pairs (along with the pair of ReliefF and ReliefF-W) are very dissimilar to all rankers outside their own cluster. We also found that Chi-Squared, Information Gain, and Symmetric Uncertainty form a cluster of similarity, as do Chi-Squared, Deviance, F-Measure, and Mutual Information.
Keywords :
DNA; bioinformatics; biosensors; data mining; genetics; lab-on-a-chip; molecular biophysics; molecular configurations; probability; redundancy; F-measurement; Kolmogorov-Smirnov statistic; chi-square; data mining preprocessing technique; feature gene selection; feature ranking techniques; feature subset size; gene probes; genetic data analysis; geometric mean; gini index; imbalanced DNA microarray datasets; information gain; multiple feature selection techniques; mutual information; optimal subset; probability ratio; redundancy; similarity analysis; symmetric uncertainty; Bioinformatics; DNA; Data mining; Indexes; Stability criteria; DNA microarray; Similarity; feature selection;
Conference_Titel :
Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on
Conference_Location :
Philadelphia, PA
Print_ISBN :
978-1-4673-2559-2
Electronic_ISBN :
978-1-4673-2558-5
DOI :
10.1109/BIBM.2012.6392708