DocumentCode
3383660
Title
DNA sequence compression using the normalized maximum likelihood model for discrete regression
Author
Tabus, Ioan ; Korodi, Gergely ; Rissanen, Jorma
Author_Institution
Inst. of Signal Process., Tampere Univ. of Technol., Finland
fYear
2003
fDate
25-27 March 2003
Firstpage
253
Lastpage
262
Abstract
The use of normalized maximum likelihood (NML) model for encoding sequences known to have regularities in the form of approximate repetitions was discussed. A particular version of the NML model was presented for discrete regression, which was shown to provide a very powerful yet simple model for encoding the approximate repeats in DNA sequences. Combining the model of repeats with a simple first order Markov model, a fast lossless compression method was obtained that compares favorably with the existing DNA compression programs. It is remarkable that a simple model, which recursively updates a small number of parameters, is able to reach the state of the art compression ratio for DNA sequences with much more complex models. Being a minimum description length (MDL) model, the NML model may later prove to be useful in studying global and local features of DNA or possibly of other biological sequences.
Keywords
DNA; Markov processes; biology computing; data compression; encoding; maximum likelihood estimation; sequences; DNA compression programs; DNA sequence compression; MDL model; Markov model; NML model; approximate repetitions; biological sequences; compression ratio; deoxyribonucleic acids; discrete regression; fast lossless compression method; global DNA feature; local DNA features; minimum description length; normalized maximum likelihood; parameter updating; Biological information theory; Biological system modeling; Biomedical signal processing; DNA; Data compression; Dictionaries; Encoding; Entropy; History; Sequences;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Compression Conference, 2003. Proceedings. DCC 2003
ISSN
1068-0314
Print_ISBN
0-7695-1896-6
Type
conf
DOI
10.1109/DCC.2003.1194016
Filename
1194016
Link To Document