• DocumentCode
    3146827
  • Title

    Comparative study of ensemble learning approaches in the identification of disease mutations

  • Author

    Wu, Jiaxin ; Zhang, Wangshu ; Jiang, Rui

  • Author_Institution
    Dept. of Autom., Tsinghua Univ., Beijing, China
  • Volume
    6
  • fYear
    2010
  • fDate
    16-18 Oct. 2010
  • Firstpage
    2306
  • Lastpage
    2310
  • Abstract
    With the accelerating advancement of biomedical research, it has been widely accepted that genetic variation plays a critical role in the pathogenesis of human inherited diseases. As an important type of genetic variation, nonsynonymous single nucleotide polymorphisms (nsSNPs) that occur in protein coding regions lead to amino acid substitutions in proteins, affecting structures and functions of proteins, and potentially causing human diseases. Hence, identifying disease-associated nsSNPs against neutral ones by machine learning approaches plays an important role in the understanding of genetic bases of human diseases and further promoting the prevention, diagnosis, and treatment of these diseases. In this paper, we formulate the task of identifying disease-associated nsSNPs as a binary classification problem. Based on a set of 26 numeric features derived from protein sequence information, we compare the performance of five popular ensemble learning approaches (AdaBoost, LogitBoost, Random forests, L2 boosting and stochastic gradient regression) with two traditional classification methods (decision trees and support vector machines) in this classification problem. Systematic validation demonstrates that ensemble learning approaches are in general more effective in identifying the disease-associated nsSNPs, while LogitBoost can achieve the highest performance among all the methods compared.
  • Keywords
    DNA; decision trees; diseases; genetics; gradient methods; learning (artificial intelligence); medical computing; molecular biophysics; molecular configurations; pattern classification; proteins; regression analysis; support vector machines; AdaBoost; L2 boosting; LogitBoost; binary classification problem; decision trees; disease associated nsSNP; disease mutation identification; ensemble learning approach; genetic disease diagnosis; genetic disease prevention; genetic disease treatment; genetic variation; human disease genetic basis; human inherited disease pathogenesis; machine learning; nonsynonymous single nucleotide polymorphisms; protein amino acid substitution; protein coding region; protein function; protein structure; random forests; stochastic gradient regression; support vector machines; binary classification; ensemble learning; nonsynonymous single nucleotide polymorphisms;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Biomedical Engineering and Informatics (BMEI), 2010 3rd International Conference on
  • Conference_Location
    Yantai
  • Print_ISBN
    978-1-4244-6495-1
  • Type

    conf

  • DOI
    10.1109/BMEI.2010.5639753
  • Filename
    5639753