Title :
Comparative study of ensemble learning approaches in the identification of disease mutations
Author :
Wu, Jiaxin ; Zhang, Wangshu ; Jiang, Rui
Author_Institution :
Dept. of Autom., Tsinghua Univ., Beijing, China
Abstract :
With the accelerating advancement of biomedical research, it has been widely accepted that genetic variation plays a critical role in the pathogenesis of human inherited diseases. As an important type of genetic variation, nonsynonymous single nucleotide polymorphisms (nsSNPs) that occur in protein coding regions lead to amino acid substitutions in proteins, affecting structures and functions of proteins, and potentially causing human diseases. Hence, identifying disease-associated nsSNPs against neutral ones by machine learning approaches plays an important role in the understanding of genetic bases of human diseases and further promoting the prevention, diagnosis, and treatment of these diseases. In this paper, we formulate the task of identifying disease-associated nsSNPs as a binary classification problem. Based on a set of 26 numeric features derived from protein sequence information, we compare the performance of five popular ensemble learning approaches (AdaBoost, LogitBoost, Random forests, L2 boosting and stochastic gradient regression) with two traditional classification methods (decision trees and support vector machines) in this classification problem. Systematic validation demonstrates that ensemble learning approaches are in general more effective in identifying the disease-associated nsSNPs, while LogitBoost can achieve the highest performance among all the methods compared.
Keywords :
DNA; decision trees; diseases; genetics; gradient methods; learning (artificial intelligence); medical computing; molecular biophysics; molecular configurations; pattern classification; proteins; regression analysis; support vector machines; AdaBoost; L2 boosting; LogitBoost; binary classification problem; decision trees; disease associated nsSNP; disease mutation identification; ensemble learning approach; genetic disease diagnosis; genetic disease prevention; genetic disease treatment; genetic variation; human disease genetic basis; human inherited disease pathogenesis; machine learning; nonsynonymous single nucleotide polymorphisms; protein amino acid substitution; protein coding region; protein function; protein structure; random forests; stochastic gradient regression; support vector machines; binary classification; ensemble learning; nonsynonymous single nucleotide polymorphisms;
Conference_Titel :
Biomedical Engineering and Informatics (BMEI), 2010 3rd International Conference on
Conference_Location :
Yantai
Print_ISBN :
978-1-4244-6495-1
DOI :
10.1109/BMEI.2010.5639753