Title :
Protein is incompressible
Author :
Nevill-manning, Craig G. ; Witten, Ian H.
Author_Institution :
Rutgers Univ., Piscataway, NJ, USA
Abstract :
Life is based on two polymers, DNA and protein, whose properties can be described in a simple text file. It is natural to expect that standard text compression techniques would work on biological sequences as they do on English text. But biological sequences have a fundamentally different structure from linguistic ones, and standard compression schemes exhibit disappointing performance on them. We describe a new approach to compression that takes account of the underlying biochemical principles. This gives rise to a generalization of blending for statistical compressors where every context is used, weighted by its similarity to the current context. Results support what research in bioinformatics has shown, that there is little Markov dependency in protein. This cripples data compression schemes and reduces them to order zero models
Keywords :
DNA; biology computing; data compression; polymers; proteins; statistical analysis; DNA; biochemical principles; biological sequences; blending; context weighting; order zero models; polymers; protein; similarity weighting; statistical compressors; Bioinformatics; Compressors; Computer science; DNA; Data compression; Databases; Genetic mutations; Organisms; Polymers; Proteins; Sampling methods; Sequences;
Conference_Titel :
Data Compression Conference, 1999. Proceedings. DCC '99
Conference_Location :
Snowbird, UT
Print_ISBN :
0-7695-0096-X
DOI :
10.1109/DCC.1999.755675