DocumentCode :
2010057
Title :
Feature Selection using a Random Forests Classifier for the Integrated Analysis of Multiple Data Types
Author :
Reif, David M. ; Motsinger, Alison A. ; McKinney, Brett A. ; Crowe, James E., Jr. ; Moore, Jason H.
Author_Institution :
Dept. of Molecular Physiol. & Biophys., Vanderbilt Univ., Nashville, TN
fYear :
2006
fDate :
28-29 Sept. 2006
Firstpage :
1
Lastpage :
8
Abstract :
Complex clinical phenotypes arise from the concerted interactions among the myriad components of a biological system. Therefore, comprehensive models can only be developed through the integrated study of multiple types of experimental data gathered from the system in question. The Random Foreststrade(RF) method is adept at identifying relevant features having only slight main effects in high-dimensional data. This method is well-suited to integrated analysis, as relevant attributes may be selected from categorical or continuous data, and there may be interactions across data types. RF is a natural approach for studying gene-gene, gene-protein, or protein-protein interactions because importance scores for particular attributes take interactions into account. Thus, Random Forests is a promising solution to the analysis challenge posed by high-dimensional datasets including interactions among attributes of different types. In this study, we characterize the performance of RF on a range of simulated genetic and/or proteomic datasets. We compare the performance of RF in identifying relevant attributes when given genetic data alone, proteomic data alone, or a combined dataset of genetic plus proteomic data. Our results indicate that utilizing multiple data types is beneficial when the disease model is complex and the phenotypic outcome-associated data type is unknown. The results of this study also show that RF is adept at identifying relevant features in high-dimensional data with small main effects and low heritability
Keywords :
biology computing; data integrity; feature extraction; genetics; molecular biophysics; pattern classification; proteins; Random Forests classifier; biological system; clinical phenotypes; data integration; feature selection; gene-gene interaction; gene-protein interaction; genetic dataset; integrated analysis; multiple data types; protein-protein interaction; proteomic dataset; Biological systems; Data analysis; Drugs; Genetics; Information analysis; Proteins; Proteomics; Radio frequency; Radiofrequency identification; Vaccines; Random Forests¿; data integration; feature selection; gene-gene interactions; multiple data types;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Intelligence and Bioinformatics and Computational Biology, 2006. CIBCB '06. 2006 IEEE Symposium on
Conference_Location :
Toronto, Ont.
Print_ISBN :
1-4244-0624-2
Electronic_ISBN :
1-4244-0624-2
Type :
conf
DOI :
10.1109/CIBCB.2006.330987
Filename :
4133169
Link To Document :
بازگشت