DocumentCode :
260301
Title :
Using Correlation-Based Feature Selection for a Diverse Collection of Bioinformatics Datasets
Author :
Wald, Randall ; Khoshgoftaar, Taghi M. ; Napolitano, Amri
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL, USA
fYear :
2014
fDate :
10-12 Nov. 2014
Firstpage :
156
Lastpage :
162
Abstract :
The large number of genes found in most gene micro array datasets demands the use of feature selection techniques to alleviate this problem of high-dimensionality. However, the computational cost of filter-based subset evaluation techniques such as Correlation-Based Feature Selection (CFS) has generally limited the use of these techniques to smaller datasets, or at least smaller collections of gene micro array datasets. No previous work has applied CFS to a large and diverse range of bioinformatics datasets. To address this deficit, we employ nine different micro array datasets exhibiting a wide range of characteristics in terms of dataset balance (fraction of instances found in the minority class) and dataset difficulty of learning (overall difficulty of building effective classification models on raw, pre-feature-selection datasets). We also use five classification learners to discover how these perform in conjunction with CFS, along with five performance metrics to give a broad perspective on our results. The results find that CFS can be used to help build effective models, in particular when used with the 5-Nearest Neighbors learner on data that is Easy or Moderate (in terms of difficulty-of-learning) or Balanced (in terms of class distribution). For other types of data, the optimal learner varies, although in most cases the Logistic Regression learner works worst in conjunction with CFS.
Keywords :
bioinformatics; biological techniques; cellular biophysics; correlation methods; feature selection; genetics; 5-Nearest Neighbors; bioinformatics datasets; correlation-based feature selection; data set dimensionality; dataset balance; dataset difficulty of learning; diverse collection; feature selection techniques; filter-based subset evaluation techniques; five classification learners; gene microarray datasets; logistic regression learner; optimal learner variations; performance metrics; pre-feature-selection datasets; Bioinformatics; Buildings; Cancer; Correlation; Measurement; Niobium; Support vector machines; Balance; Bioinformatics; Correlation-Based Feature Selection; Difficulty of Learning;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Bioinformatics and Bioengineering (BIBE), 2014 IEEE International Conference on
Conference_Location :
Boca Raton, FL
Type :
conf
DOI :
10.1109/BIBE.2014.63
Filename :
7033574
Link To Document :
بازگشت