• DocumentCode
    3618016
  • Title

    Two-level directory based compression

  • Author

    P. Skibinski

  • Author_Institution
    Inst. of Comput. Sci., Wroclaw Univ., Poland
  • fYear
    2005
  • fDate
    6/27/1905 12:00:00 AM
  • Firstpage
    481
  • Abstract
    Summary form only given. The basic idea of preprocessing is to transform the text into some intermediate form which can be used as input of any existing general-purpose compressor and compressed more efficiently. Dictionary-based preprocessing is based on the notion of replacing whole words with shorter codes. We present a dictionary-based preprocessing technique and its implementation called TWRT (two-level word replacing transformation). Our preprocessor uses several dictionaries and divides files into various kinds. The first level dictionaries (small dictionaries) are specific for some kind of data (e.g., programming language, references). The second level dictionaries (large dictionaries) are specific for natural languages (e.g., English, Russian, French). On the Calgary corpus, TWRT improves the compression performance of bzip2 by over 7% and PPMonstr by about 6% on average. Even for the top compressor nowadays, PAQ6, the gain is significant - 5%. On multilingual text files, TWRT improves the compression performance of bzip2, PPMonstr, and PAQ6 by about 8%. Moreover, TWRT improves the compression speed with PAQ6 and on larger files with PPMonstr.
  • Keywords
    "Dictionaries","Computer science","Data preprocessing","Computer languages","Natural languages","Filters","Data compression"
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference, 2005. Proceedings. DCC 2005
  • ISSN
    1068-0314
  • Print_ISBN
    0-7695-2309-9
  • Type

    conf

  • DOI
    10.1109/DCC.2005.91
  • Filename
    1402238