• DocumentCode
    2298121
  • Title

    A Simple Statistical Algorithm for Biological Sequence Compression

  • Author

    Cao, Minh Duc ; Dix, Trevor I. ; Allison, Lloyd ; Mears, Chris

  • Author_Institution
    Fac. of Inf. Technol., Monash Univ., Clayton, Vic.
  • fYear
    2007
  • fDate
    27-29 March 2007
  • Firstpage
    43
  • Lastpage
    52
  • Abstract
    This paper introduces a novel algorithm for biological sequence compression that makes use of both statistical properties and repetition within sequences. A panel of experts is maintained to estimate the probability distribution of the next symbol in the sequence to be encoded. Expert probabilities are combined to obtain the final distribution. The resulting information sequence provides insight for further study of the biological sequence. Each symbol is then encoded by arithmetic coding. Experiments show that our algorithm outperforms existing compressors on typical DNA and protein sequence datasets while maintaining a practical running time
  • Keywords
    DNA; arithmetic codes; biology computing; data compression; proteins; statistical distributions; DNA sequence datasets; arithmetic coding; biological sequence compression; expert probabilities; information sequence; probability distribution; protein sequence datasets; statistical algorithm; Bioinformatics; Biological information theory; Biological system modeling; Compression algorithms; Compressors; DNA; Genomics; Organisms; Probability distribution; Sequences;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference, 2007. DCC '07
  • Conference_Location
    Snowbird, UT
  • ISSN
    1068-0314
  • Print_ISBN
    0-7695-2791-4
  • Type

    conf

  • DOI
    10.1109/DCC.2007.7
  • Filename
    4148743