DocumentCode :
3318439
Title :
Predicting Thermophilic Nucleotide Sequences Based on Chaos Game Representation Features and Support Vector Machine
Author :
Lu, JinLong ; Hu, XueHai ; Liu, Xiaolei ; Shi, Feng
Author_Institution :
Coll. of Sci., Huazhong Agric. Univ., Wuhan, China
fYear :
2011
fDate :
10-12 May 2011
Firstpage :
1
Lastpage :
4
Abstract :
Knowledge of thermophilic mechanisms about some organisms whose optimum growth temperature (OGT) range from 50 to 80 degree plays a major role for helping design stable proteins. How to predict a DNA sequence to be thermophilic is a long but not fairly resolved problem. After downloading 10586 thermophilic bacteria nucleotide sequences and 14261 mesophilic bacteria nucleotide sequences from NCBI database and eliminating the sequences with 95% homologous similarity by CD-HIT, 1638 thermophilic and 2996 mesophilic sequences are remained. Chaos game representation (CGR) can investigate the patterns hiding in DNA sequence, visually revealing previously unknown structure. In this paper, we convert every DNA sequence into a high dimensional vector by CGR algorithm, and predict the DNA sequence thermostability by these CGR features and support vector machine (SVM) with three group experiments: 16-dimensional vector, 64-dimensional vector and 256-dimensional vector, respectively. Each group is evaluated by resubstitution test and 10-fold cross-validation test. In the resubstitution test, the results of all three groups perform highly satisfactions, in which the accuracy achieves 0.9989 and MCC (Matthews Correlation Coefficient) achieves 0.9978. In 10-fold cross-validation test, 256-dimensional vector get the the best: the average accuracy is 0.9088 and average MCC is 0.8169. The results show the effectiveness of the new algorithm.
Keywords :
DNA; biology computing; chaos; game theory; support vector machines; DNA sequence; Matthews correlation coefficient; NCBI database; chaos game representation features; mesophilic bacteria nucleotide sequences; optimum growth temperature; support vector machine; thermophilic nucleotide sequences prediction; Accuracy; DNA; Feature extraction; Genomics; Microorganisms; Proteins; Support vector machines;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Bioinformatics and Biomedical Engineering, (iCBBE) 2011 5th International Conference on
Conference_Location :
Wuhan
ISSN :
2151-7614
Print_ISBN :
978-1-4244-5088-6
Type :
conf
DOI :
10.1109/icbbe.2011.5780070
Filename :
5780070
Link To Document :
بازگشت