• DocumentCode
    3383660
  • Title

    DNA sequence compression using the normalized maximum likelihood model for discrete regression

  • Author

    Tabus, Ioan ; Korodi, Gergely ; Rissanen, Jorma

  • Author_Institution
    Inst. of Signal Process., Tampere Univ. of Technol., Finland
  • fYear
    2003
  • fDate
    25-27 March 2003
  • Firstpage
    253
  • Lastpage
    262
  • Abstract
    The use of normalized maximum likelihood (NML) model for encoding sequences known to have regularities in the form of approximate repetitions was discussed. A particular version of the NML model was presented for discrete regression, which was shown to provide a very powerful yet simple model for encoding the approximate repeats in DNA sequences. Combining the model of repeats with a simple first order Markov model, a fast lossless compression method was obtained that compares favorably with the existing DNA compression programs. It is remarkable that a simple model, which recursively updates a small number of parameters, is able to reach the state of the art compression ratio for DNA sequences with much more complex models. Being a minimum description length (MDL) model, the NML model may later prove to be useful in studying global and local features of DNA or possibly of other biological sequences.
  • Keywords
    DNA; Markov processes; biology computing; data compression; encoding; maximum likelihood estimation; sequences; DNA compression programs; DNA sequence compression; MDL model; Markov model; NML model; approximate repetitions; biological sequences; compression ratio; deoxyribonucleic acids; discrete regression; fast lossless compression method; global DNA feature; local DNA features; minimum description length; normalized maximum likelihood; parameter updating; Biological information theory; Biological system modeling; Biomedical signal processing; DNA; Data compression; Dictionaries; Encoding; Entropy; History; Sequences;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference, 2003. Proceedings. DCC 2003
  • ISSN
    1068-0314
  • Print_ISBN
    0-7695-1896-6
  • Type

    conf

  • DOI
    10.1109/DCC.2003.1194016
  • Filename
    1194016