Language-independent word-based text compression with fast decompression

Author

Grabowski, Szymon ; Swacha, Jakub

Author_Institution

Comput. Eng. Dept., Tech. Univ. of Lodz, Lodz, Poland

fYear

2010

fDate

20-23 April 2010

Firstpage

158

Lastpage

162

Abstract

A classic idea to improve text compression is to replace words with references to a text dictionary, either external or stored together with the archive. We advocate for the second option, as even with one language in mind (e.g., English) it is rather impossible to have a single dictionary fitting well different sorts of modern texts. There are basically two problems to solve, which are how to assign codewords to individual words from the parsed text, and how to represent the dictionary compactly. The resulting data are input for a backend compressor. Since in many scenarios texts are decompressed (read) more often than compressed (written), we focus on LZ77 backend compression algorithms, in particular Deflate, used in zip/gzip standards, whose well-known asset is very fast decompression.

Keywords

data compression; text analysis; word processing; Deflate; LZ77 backend compression algorithms; codewords; fast decompression; language independent word based text compression; parsed text; text dictionary; zip-gzip standards; Cascading style sheets; Compression algorithms; DNA; Dictionaries; HTML; Natural languages; Postal services; Protein sequence; Spatial databases; XML; byte codes; dictionary compression; text compression;

fLanguage

English

Publisher

ieee

Conference_Titel

Perspective Technologies and Methods in MEMS Design (MEMSTECH), 2010 Proceedings of VIth International Conference on

Conference_Location

Lviv

Print_ISBN

978-1-4244-7325-0

Electronic_ISBN

978-966-2191-11-0

Type

conf

Filename

5499297