DocumentCode
3618016
Title
Two-level directory based compression
Author
P. Skibinski
Author_Institution
Inst. of Comput. Sci., Wroclaw Univ., Poland
fYear
2005
fDate
6/27/1905 12:00:00 AM
Firstpage
481
Abstract
Summary form only given. The basic idea of preprocessing is to transform the text into some intermediate form which can be used as input of any existing general-purpose compressor and compressed more efficiently. Dictionary-based preprocessing is based on the notion of replacing whole words with shorter codes. We present a dictionary-based preprocessing technique and its implementation called TWRT (two-level word replacing transformation). Our preprocessor uses several dictionaries and divides files into various kinds. The first level dictionaries (small dictionaries) are specific for some kind of data (e.g., programming language, references). The second level dictionaries (large dictionaries) are specific for natural languages (e.g., English, Russian, French). On the Calgary corpus, TWRT improves the compression performance of bzip2 by over 7% and PPMonstr by about 6% on average. Even for the top compressor nowadays, PAQ6, the gain is significant - 5%. On multilingual text files, TWRT improves the compression performance of bzip2, PPMonstr, and PAQ6 by about 8%. Moreover, TWRT improves the compression speed with PAQ6 and on larger files with PPMonstr.
Keywords
"Dictionaries","Computer science","Data preprocessing","Computer languages","Natural languages","Filters","Data compression"
Publisher
ieee
Conference_Titel
Data Compression Conference, 2005. Proceedings. DCC 2005
ISSN
1068-0314
Print_ISBN
0-7695-2309-9
Type
conf
DOI
10.1109/DCC.2005.91
Filename
1402238
Link To Document