• DocumentCode
    3028848
  • Title

    Phrase hierarchy inference and compression in bounded space

  • Author

    Manning, Graig G Nelvill ; Witten, Ian H.

  • Author_Institution
    Dept. of Biochem., Stanford Univ., CA, USA
  • fYear
    1998
  • fDate
    30 Mar-1 Apr 1998
  • Firstpage
    179
  • Lastpage
    188
  • Abstract
    Text compression by inferring a phrase hierarchy from the input is a technique that shows promise as a compression scheme and as a machine learning method that extracts some comprehensible account of the structure of the input text. Its performance as a data compression scheme outstrips other dictionary schemes, and the structures that it learns from sequences have been put to such eclectic uses as phrase browsing in digital libraries, music analysis, and inferring rules for fractal images. We focus attention on the memory requirements of the method. Since the algorithm operates in linear time, the space it consumes is at most linear with input size. The space consumed does in fact grow linearly with the size of the inferred hierarchy, and this makes operation on very large files infeasible. We describe two elegant ways of curtailing the space complexity of hierarchy inference, one of which yields a bounded space algorithm. We begin with a review of the hierarchy inference procedure that is embodied in the SEQUITUR program. Then we consider its performance on quite large files, and show how the compression performance improves as the file size increases
  • Keywords
    data compression; grammars; inference mechanisms; word processing; SEQUITUR program; bounded space algorithm; compression performance; data compression; dictionary schemes; digital libraries; file size; fractal images; input text structure; large files; linear time algorithm; machine learning method; memory requirements; music analysis; phrase browsing; phrase hierarchy inference; rules; space complexity; text compression; Data compression; Data mining; Dictionaries; Fractals; Image analysis; Image sequence analysis; Inference algorithms; Learning systems; Performance analysis; Software libraries;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference, 1998. DCC '98. Proceedings
  • Conference_Location
    Snowbird, UT
  • ISSN
    1068-0314
  • Print_ISBN
    0-8186-8406-2
  • Type

    conf

  • DOI
    10.1109/DCC.1998.672146
  • Filename
    672146