Title :
Compression of biological sequences by greedy off-line textual substitution
Author :
Apostolico, Alberto ; Lonardi, Stefano
Author_Institution :
Dept. of Comput. Sci., Purdue Univ., West Lafayette, IN, USA
Abstract :
We follow one of the simplest possible steepest descent paradigms. This consists of performing repeated stages in each one of which we identify a substring of the current version of the text yielding the maximum compression, and then replace all those occurrences except one with a pair of pointers to the untouched occurrence. This is somewhat dual with respect to the bottom up vocabulary buildup scheme considered by Rubin. This simple scheme already poses some interesting algorithmic problems. In terms of performance, the method does outperform current Lempel-Ziv implementations in most of cases. Here we show that, on biological sequences, it beats all other generic compression methods and approaches the performance of methods specifically built around some peculiar regularities of DNA sequences, such as tandem repeats and palindromes, that are neither distinguished nor treated selectively here. The most interesting performances, however, are obtained in the compression of entire groups of genetic sequences forming families with similar characteristics. This is becoming a standard and useful way to group sequences in a growing number of important specialized databases. On such inputs, the approach presented here yields scores that are not only better than those of any other method, but also improve increasingly with increasing input size. This is to be attributed to a certain ability to capture distant relationships among the sequences in a family
Keywords :
DNA; biology computing; data compression; sequences; string matching; DNA sequences; biological sequence compression; genetic sequences; greedy off-line textual substitution; performance; specialized databases; steepest descent; substring; Application software; Biological information theory; DNA; Databases; Encoding; Genetics; Laboratories; Organisms; Production; Sequences;
Conference_Titel :
Data Compression Conference, 2000. Proceedings. DCC 2000
Conference_Location :
Snowbird, UT
Print_ISBN :
0-7695-0592-9
DOI :
10.1109/DCC.2000.838154