• DocumentCode
    3490872
  • Title

    Predicting functional impact of single amino acid polymorphisms by integrating sequence and structural features

  • Author

    Wang, Mingjun ; Shen, Hong-Bin ; Akutsu, Tatsuya ; Song, Jiangning

  • Author_Institution
    State Eng. Lab. for Ind. Enzymes, Chinese Acad. of Sci., Tianjin, China
  • fYear
    2011
  • fDate
    2-4 Sept. 2011
  • Firstpage
    18
  • Lastpage
    26
  • Abstract
    Single amino acid polymorphisms (SAPs) are the most abundant form of known genetic variations associated with human diseases. It is of great interest to study the sequence-structure-function relationship underlying SAPs. In this work, we collected the human variant data from three databases and divided them into three categories, i.e. cancer somatic mutations (CSM), Mendelian disease-related variant (SVD) and neutral polymorphisms (SVP). We built support vector machine (SVM) classifiers to predict these three classes of SAPs, using the optimal features selected by a random forest algorithm. Consequently, 280 sequence-derived and structural features were initially extracted from the curated datasets from which 18 optimal candidate features were further selected by random forest. Furthermore, we performed a stepwise feature selection to select characteristic sequence and structural features that are important for predicting each SAPs class. As a result, our predictors achieved a prediction accuracy (ACC) of 84.97, 96.93, 86.98 and 88.24%, for the three classes, CSM, SVD and SVP, respectively. Performance comparison with other previously developed tools such as SIFT, SNAP and Polyphen2 indicates that our method provides a favorable performance with higher Sensitivity scores and Matthew´s correlation coefficients (MCC). These results indicate that the prediction performance of SAPs classifiers can be effectively improved by feature selection. Moreover, division of SAPs into three respective categories and construction of accurate SVM-based classifiers for each class provides a practically useful way for investigating the difference between Mendelian disease-related variants and cancer somatic mutations.
  • Keywords
    biological techniques; biology computing; cancer; genetics; molecular biophysics; molecular configurations; organic compounds; support vector machines; Mendelian disease-related variant; SAPs class; SVM-based classifiers; cancer somatic mutations; genetic variations; human variant data; neutral polymorphisms; random forest algorithm; single amino acid polymorphisms; stepwise feature selection; support vector machine; Accuracy; Amino acids; Databases; Feature extraction; Humans; Proteins; Support vector machines; feature selection; non-synonymous SNPs; random forest; single amino acid polymorphisms (SAPs); support vector machine;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Systems Biology (ISB), 2011 IEEE International Conference on
  • Conference_Location
    Zhuhai
  • Print_ISBN
    978-1-4577-1661-4
  • Electronic_ISBN
    978-1-4577-1665-2
  • Type

    conf

  • DOI
    10.1109/ISB.2011.6033115
  • Filename
    6033115