DocumentCode
3490872
Title
Predicting functional impact of single amino acid polymorphisms by integrating sequence and structural features
Author
Wang, Mingjun ; Shen, Hong-Bin ; Akutsu, Tatsuya ; Song, Jiangning
Author_Institution
State Eng. Lab. for Ind. Enzymes, Chinese Acad. of Sci., Tianjin, China
fYear
2011
fDate
2-4 Sept. 2011
Firstpage
18
Lastpage
26
Abstract
Single amino acid polymorphisms (SAPs) are the most abundant form of known genetic variations associated with human diseases. It is of great interest to study the sequence-structure-function relationship underlying SAPs. In this work, we collected the human variant data from three databases and divided them into three categories, i.e. cancer somatic mutations (CSM), Mendelian disease-related variant (SVD) and neutral polymorphisms (SVP). We built support vector machine (SVM) classifiers to predict these three classes of SAPs, using the optimal features selected by a random forest algorithm. Consequently, 280 sequence-derived and structural features were initially extracted from the curated datasets from which 18 optimal candidate features were further selected by random forest. Furthermore, we performed a stepwise feature selection to select characteristic sequence and structural features that are important for predicting each SAPs class. As a result, our predictors achieved a prediction accuracy (ACC) of 84.97, 96.93, 86.98 and 88.24%, for the three classes, CSM, SVD and SVP, respectively. Performance comparison with other previously developed tools such as SIFT, SNAP and Polyphen2 indicates that our method provides a favorable performance with higher Sensitivity scores and Matthew´s correlation coefficients (MCC). These results indicate that the prediction performance of SAPs classifiers can be effectively improved by feature selection. Moreover, division of SAPs into three respective categories and construction of accurate SVM-based classifiers for each class provides a practically useful way for investigating the difference between Mendelian disease-related variants and cancer somatic mutations.
Keywords
biological techniques; biology computing; cancer; genetics; molecular biophysics; molecular configurations; organic compounds; support vector machines; Mendelian disease-related variant; SAPs class; SVM-based classifiers; cancer somatic mutations; genetic variations; human variant data; neutral polymorphisms; random forest algorithm; single amino acid polymorphisms; stepwise feature selection; support vector machine; Accuracy; Amino acids; Databases; Feature extraction; Humans; Proteins; Support vector machines; feature selection; non-synonymous SNPs; random forest; single amino acid polymorphisms (SAPs); support vector machine;
fLanguage
English
Publisher
ieee
Conference_Titel
Systems Biology (ISB), 2011 IEEE International Conference on
Conference_Location
Zhuhai
Print_ISBN
978-1-4577-1661-4
Electronic_ISBN
978-1-4577-1665-2
Type
conf
DOI
10.1109/ISB.2011.6033115
Filename
6033115
Link To Document