• DocumentCode
    2399602
  • Title

    Modeling word occurrences for the compression of concordances

  • Author

    Bookstein, A. ; Klein, S.T. ; Raita, T. ; Lankinen, P. ; Sze, C.W.

  • Author_Institution
    Center for Inf. & Language Studies, Chicago Univ., IL, USA
  • fYear
    1995
  • fDate
    28-30 Mar 1995
  • Firstpage
    462
  • Abstract
    Summary form only given. Effective compression of a text-based information retrieval system involves compression not only the text itself, but also of the concordance by which one accesses that text and which occupies an amount of storage comparable to the text itself. The concordance can be a rather complicated data structure, especially if it permits hierarchical access to the database. But one or more components of the hierarchy can usually be conceptualized as a bit-map. We conceptualize our bit-map as being generated as follows. At any bit-map site we are in one of two states: a cluster state (C), or a between-cluster state (B). In a given state, we generate a bit-map-value of zero or one and, governed by the transition probabilities of the model, enter a new state as we move to the next bit-map site. Such a model has been referred to as a hidden Markov model in the literature. Unfortunately, this model is analytically difficult to use. To approximate it, we introduce several traditional Markov models with four states each, B and C as above, and two transitional states. We present the models, show how they are connected, and state the formal compression algorithm based on these models. We also include some experimental results
  • Keywords
    data compression; hidden Markov models; information retrieval; information retrieval systems; probability; word processing; between-cluster state; bit-map; bitmap; cluster state; concordances; data structure; database; experimental results; formal compression algorithm; hidden Markov model; hierarchical access; text-based information retrieval system; transition probabilities; transitional states; word occurrences modelling; Compression algorithms; Computer science; Data structures; Databases; Hidden Markov models; Information retrieval;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference, 1995. DCC '95. Proceedings
  • Conference_Location
    Snowbird, UT
  • ISSN
    1068-0314
  • Print_ISBN
    0-8186-7012-6
  • Type

    conf

  • DOI
    10.1109/DCC.1995.515572
  • Filename
    515572