مرکز منطقه ای اطلاع رساني علوم و فناوري

Abstract :

Summary form only given. The basic idea of preprocessing is to transform the text into some intermediate form which can be used as input of any existing general-purpose compressor and compressed more efficiently. Dictionary-based preprocessing is based on the notion of replacing whole words with shorter codes. We present a dictionary-based preprocessing technique and its implementation called TWRT (two-level word replacing transformation). Our preprocessor uses several dictionaries and divides files into various kinds. The first level dictionaries (small dictionaries) are specific for some kind of data (e.g., programming language, references). The second level dictionaries (large dictionaries) are specific for natural languages (e.g., English, Russian, French). On the Calgary corpus, TWRT improves the compression performance of bzip2 by over 7% and PPMonstr by about 6% on average. Even for the top compressor nowadays, PAQ6, the gain is significant - 5%. On multilingual text files, TWRT improves the compression performance of bzip2, PPMonstr, and PAQ6 by about 8%. Moreover, TWRT improves the compression speed with PAQ6 and on larger files with PPMonstr.