• DocumentCode
    3430242
  • Title

    A corpus for the evaluation of lossless compression algorithms

  • Author

    Arnold, Ross ; Bell, Tim

  • Author_Institution
    Dept. of Comput. Sci., Canterbury Univ., Christchurch, New Zealand
  • fYear
    1997
  • fDate
    25-27 Mar 1997
  • Firstpage
    201
  • Lastpage
    210
  • Abstract
    A number of authors have used the Calgary corpus of texts to provide empirical results for lossless compression algorithms. This corpus was collected in 1987, although it was not published until 1990. The advances with compression algorithms have been achieving relatively small improvements in compression, measured using the Calgary corpus. There is a concern that algorithms are being fine-tuned to this corpus, and that small improvements measured in this way may not apply to other files. Furthermore, the corpus is almost ten years old, and over this period there have been changes in the kinds of files that are compressed, particularly with the development of the Internet, and the rapid growth of high-capacity secondary storage for personal computers. We explore the issues raised above, and develop a principled technique for collecting a corpus of test data for compression methods. A corpus, called the Canterbury corpus, is developed using this technique, and we report the performance of a collection of compression methods using the new corpus
  • Keywords
    data compression; decoding; digital storage; encoding; Calgary corpus; Canterbury corpus; Internet; compression methods performance; compression methods testing; decoding; encoding; high capacity secondary storage; lossless compression algorithms; personal computers; Compression algorithms; Computer science; Convergence; Decoding; Encoding; Entropy; Internet; Microcomputers; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference, 1997. DCC '97. Proceedings
  • Conference_Location
    Snowbird, UT
  • ISSN
    1068-0314
  • Print_ISBN
    0-8186-7761-9
  • Type

    conf

  • DOI
    10.1109/DCC.1997.582019
  • Filename
    582019