DocumentCode
3146827
Title
Comparative study of ensemble learning approaches in the identification of disease mutations
Author
Wu, Jiaxin ; Zhang, Wangshu ; Jiang, Rui
Author_Institution
Dept. of Autom., Tsinghua Univ., Beijing, China
Volume
6
fYear
2010
fDate
16-18 Oct. 2010
Firstpage
2306
Lastpage
2310
Abstract
With the accelerating advancement of biomedical research, it has been widely accepted that genetic variation plays a critical role in the pathogenesis of human inherited diseases. As an important type of genetic variation, nonsynonymous single nucleotide polymorphisms (nsSNPs) that occur in protein coding regions lead to amino acid substitutions in proteins, affecting structures and functions of proteins, and potentially causing human diseases. Hence, identifying disease-associated nsSNPs against neutral ones by machine learning approaches plays an important role in the understanding of genetic bases of human diseases and further promoting the prevention, diagnosis, and treatment of these diseases. In this paper, we formulate the task of identifying disease-associated nsSNPs as a binary classification problem. Based on a set of 26 numeric features derived from protein sequence information, we compare the performance of five popular ensemble learning approaches (AdaBoost, LogitBoost, Random forests, L2 boosting and stochastic gradient regression) with two traditional classification methods (decision trees and support vector machines) in this classification problem. Systematic validation demonstrates that ensemble learning approaches are in general more effective in identifying the disease-associated nsSNPs, while LogitBoost can achieve the highest performance among all the methods compared.
Keywords
DNA; decision trees; diseases; genetics; gradient methods; learning (artificial intelligence); medical computing; molecular biophysics; molecular configurations; pattern classification; proteins; regression analysis; support vector machines; AdaBoost; L2 boosting; LogitBoost; binary classification problem; decision trees; disease associated nsSNP; disease mutation identification; ensemble learning approach; genetic disease diagnosis; genetic disease prevention; genetic disease treatment; genetic variation; human disease genetic basis; human inherited disease pathogenesis; machine learning; nonsynonymous single nucleotide polymorphisms; protein amino acid substitution; protein coding region; protein function; protein structure; random forests; stochastic gradient regression; support vector machines; binary classification; ensemble learning; nonsynonymous single nucleotide polymorphisms;
fLanguage
English
Publisher
ieee
Conference_Titel
Biomedical Engineering and Informatics (BMEI), 2010 3rd International Conference on
Conference_Location
Yantai
Print_ISBN
978-1-4244-6495-1
Type
conf
DOI
10.1109/BMEI.2010.5639753
Filename
5639753
Link To Document