DocumentCode
952238
Title
Gene Classification Using Codon Usage and Support Vector Machines
Author
Ma, Jianmin ; Nguyen, Minh N. ; Rajapakse, Jagath C.
Author_Institution
Biolnf. Res. Center, Nanyang Technol. Univ., Singapore
Volume
6
Issue
1
fYear
2009
Firstpage
134
Lastpage
143
Abstract
A novel approach for gene classification, which adopts codon usage bias as input feature vector for classification by support vector machines (SVM) is proposed. The DNA sequence is first converted to a 59-dimensional feature vector where each element corresponds to the relative synonymous usage frequency of a codon. As the input to the classifier is independent of sequence length and variance, our approach is useful when the sequences to be classified are of different lengths, a condition that homology-based methods tend to fail. The method is demonstrated by using 1,841 Human Leukocyte Antigen (HLA) sequences which are classified into two major classes: HLA-I and HLA-II; each major class is further subdivided into sub-groups of HLA-I and HLA-II molecules. Using codon usage frequencies, binary SVM achieved accuracy rate of 99.3% for HLA major class classification and multi-class SVM achieved accuracy rates of 99.73% and 98.38% for sub-class classification of HLA-I and HLA-II molecules, respectively. The results show that gene classification based on codon usage bias is consistent with the molecular structures and biological functions of HLA molecules.
Keywords
DNA; biology computing; genetics; molecular biophysics; pattern classification; support vector machines; DNA sequence; HLA-I molecules; HLA-II molecules; biological function; codon usage bias; codon usage frequencies; gene classification; human leukocyte antigen sequences; input feature vector; molecular structures; support vector machines; Cluster analysis; Human Leukocyte Antigen (HLA); Major Histocompatibility Complex (MHC); Relative Synonymous Codon Use (RSCU) frequency; codon usage bias; gene classification; Algorithms; Artificial Intelligence; Codon; Databases, Genetic; Discriminant Analysis; Genes; Genes, MHC Class I; Genes, MHC Class II; Genetic Code; HLA Antigens; Humans; Major Histocompatibility Complex; Normal Distribution; Pattern Recognition, Automated; Reproducibility of Results; Sequence Analysis, DNA;
fLanguage
English
Journal_Title
Computational Biology and Bioinformatics, IEEE/ACM Transactions on
Publisher
ieee
ISSN
1545-5963
Type
jour
DOI
10.1109/TCBB.2007.70240
Filename
4359889
Link To Document