Title :
Gene Classification using Codon Usage and SVMs
Author :
Ma, Jianmin ; Nguyen, Minh N. ; Pang, Gavyn W L ; Rajapakse, Jagath C.
Author_Institution :
BioInformatics Research Center, School of Computer Engineering Nanyang Technological University, Nanyang Avenue, Singapore, jmma@ntu.edu.sg
Abstract :
A novel approach for gene classification is proposed, which adopts codon usage bias pattern as feature vector for the subsequent classification using Support Vector Machines (SVMs). A given DNA sequence is first converted to 59-dimensional feature vector, each element corresponding to the relative synonymous usage frequency of a codon. Therefore, the input to the classifier is independent of the size of the DNA sequences. Therefore, our approach is useful when the genes to be classified are of different length, where the homology-based methods are inapplicable due to the difficulty in the alignment of sequences having different lengths. The applicability and usage of the present method is demonstrated by a classification of 1841 HLA (Human Leukocyte Antigen) coding sequences selected from the database of IMGT/HLA. Using the codon usage frequencies, the binary SVM achieved accuracy up to 99.30% for classification human MHC (Major Histocompatibility Complex) molecules in their major classes: MHC-I and MHC-II. By using a multi-class SVM approach, the accuracy rates of 99.73% and 98.38% were achieved for subclasss classification of MHC-I and MHC-II classes, respectively. The results show that the proposed method is capable of accurately classifying MHC molecules in to their major classes as well as in to the subclasses within major classes. Also, the results of gene classification according to the codon usage bias pattern are consistent with the molecule structures and biological functions, further validating our approach.
Keywords :
Codon usage bias; Human Leukocyte Antigen (HLA); Major Histocompatibility Complex (MHC); Relative Synonymous Codon Usage (RSCU); Support Vector Machines (SVM); gene classification; Bioinformatics; DNA; Frequency; Genetic mutations; Humans; Proteins; Sequences; Support vector machine classification; Support vector machines; White blood cells; Codon usage bias; Human Leukocyte Antigen (HLA); Major Histocompatibility Complex (MHC); Relative Synonymous Codon Usage (RSCU); Support Vector Machines (SVM); gene classification;
Conference_Titel :
Computational Intelligence in Bioinformatics and Computational Biology, 2005. CIBCB '05. Proceedings of the 2005 IEEE Symposium on
Print_ISBN :
0-7803-9387-2
DOI :
10.1109/CIBCB.2005.1594951