• DocumentCode
    3384588
  • Title

    A dictionary-based multi-corpora text compression system

  • Author

    Sun, Weifeng ; Zhang, Nan ; Mukherjee, Amar

  • Author_Institution
    Dept. of Comput. Sci., Central Florida Univ., Orlando, FL, USA
  • fYear
    2003
  • fDate
    25-27 March 2003
  • Firstpage
    448
  • Abstract
    Summary form only given. StarZip, a multi-copora text compression system, was introduced together with its transform engine StarNT. One of the key features of the StarZip compression system is to develop domain specific dictionaries and provide tools to develop such dictionaries. StarNT was utilized because it achieves a superior compression ratio than almost all the other recent efforts based on BWT and PPM. StarNT is a dictionary-based fast lossless text transform. The main idea is to record each English word with a representation of no more than three symbols. This transform maintains most of the original context information at the word level and provides an "artificial" strong context. It ultimately reduces the size of the transformed text that, in turn, is provided to a backend compressor. This data structure provides a very fast transform encoding with a low storage overhead. StarNT also treats the transformed codewords as an offset of words in the transform dictionary. The time complexity for searching a word in the dictionary is achieved in the transform decoder. Experimental results have shown that the average compression time has improved by orders magnitude compared to previous dictionary-based transform LIPT. The complexity and compression performance of bzip2, in conjunction with this transform, is better than both gzip and PPMD. Results from five copora have shown that StarZip achieved an average improvement in compression performance (in terms of BPC) of 13% over bzip2-9, 19% over gzip-9, and 10% over PPMD.
  • Keywords
    data compression; data structures; dictionaries; text analysis; transform coding; BWT; English word representation; LIPT transform; PPM; PPMD; StarNT transform engine; StarZip text compression system; artificial context; backend compressor; bzip2; codewords; compression ratio; compression time; context information; dictionary-based multi-corpora text compression system; gzip; lossless text transform; storage overhead; time complexity; transform decoder; transform dictionary; transform encoding; word offset; Computer science; Data compression; Data structures; Decoding; Dictionaries; Encoding; Engines; Frequency; Sun;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference, 2003. Proceedings. DCC 2003
  • ISSN
    1068-0314
  • Print_ISBN
    0-7695-1896-6
  • Type

    conf

  • DOI
    10.1109/DCC.2003.1194067
  • Filename
    1194067