DocumentCode :
3225552
Title :
Word-Based Statistical Compressors as Natural Language Compression Boosters
Author :
Farina, A. ; Navarro, Gonzalo ; Param, José R.
Author_Institution :
Univ. of A Coruna, A Coruna
fYear :
2008
fDate :
25-27 March 2008
Firstpage :
162
Lastpage :
171
Abstract :
Semistatic word-based byte-oriented compression codes are known to be attractive alternatives to compress natural language texts. With compression ratios around 30%, they allow direct pattern searching on the compressed text up to 8 times faster than on its uncompressed version. In this paper we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors such as the block-wise bzip2, those from the Ziv-Lempel family, and the predictive ppm-based ones, can benefit from compressing not the original text, but its compressed representation obtained by a word-based byte-oriented statistical compressor. In particular, our experimental results show that using Dense-Code-based compression as a preprocessing step to classical compressors like bzip2, gzip, or ppmdi, yields several important benefits. For example, the ppm family is known for achieving the best compression ratios. With a Dense coding preprocessing, ppmdi achieves even better compression ratios (the best we know of on natural language) and much faster compression/decompression than ppmdi alone. Text indexing also profits from our preprocessing step. A compressed self-index achieves much better space and time performance when preceded by a semistatic word-based compression step. We show, for example, that the AF-FMindex coupled with Tagged Huffman coding is an attractive alternative index for natural language texts.
Keywords :
data compression; indexing; natural language processing; text analysis; word processing; Dense coding preprocessing; Dense-code-based compression; compression ratio; direct pattern searching; natural language compression booster; natural language text; semistatic word-based byte-oriented compression code; semistatic word-based compression; tagged Huffman coding; text indexing; word-based byte-oriented statistical compressor; word-based statistical compressor; Compressors; Computer science; Data compression; Databases; Frequency; Huffman coding; Indexing; Natural languages; Predictive models; Statistics; Text compression; compression boosting; indexing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Compression Conference, 2008. DCC 2008
Conference_Location :
Snowbird, UT
ISSN :
1068-0314
Print_ISBN :
978-0-7695-3121-2
Type :
conf
DOI :
10.1109/DCC.2008.14
Filename :
4483294
Link To Document :
بازگشت