Title :
Comprehensive autoregressive modeling for classification of genomic sequences
Author :
Akhtar, Mahmood ; Ambikairajah, Eliathamby ; Epps, Julien
Author_Institution :
New South Wales Univ., Sydney
Abstract :
In this paper, we propose the novel use of an autoregressive (AR) model to produce a multi-dimensional feature for distinguishing between genomic protein coding and non-coding regions, at their nucleotide level. In contrast to previous research, in which AR models were used to estimate a single frequency, here AR model parameters characterizing the entire short-term sequence spectrum are employed as a feature in conjunction with Gaussian mixture model-based classification. The optimized AR-based features are then combined with other signal processing based time-domain and frequency-domain features to advance detection accuracy for the coding/non-coding region classification problem. The system described herein is shown to produce identification accuracies of more than 78.9%, and 81.6% respectively for protein coding and non-coding nucleotides, when evaluated on the GENSCAN test set.
Keywords :
Gaussian processes; autoregressive processes; biology computing; pattern classification; Gaussian mixture model-based classification; comprehensive autoregressive modeling; genomic protein coding; genomic sequences classification; nucleotide level; signal processing; Bioinformatics; DNA; Discrete Fourier transforms; Feature extraction; Frequency estimation; Genomics; Multidimensional signal processing; Proteins; Sequences; Time domain analysis; DNA; Gaussian mixture models; autoregressive models; discrete Fourier transforms; discrete cosine transforms;
Conference_Titel :
Information, Communications & Signal Processing, 2007 6th International Conference on
Conference_Location :
Singapore
Print_ISBN :
978-1-4244-0982-2
Electronic_ISBN :
978-1-4244-0983-9
DOI :
10.1109/ICICS.2007.4449750