• DocumentCode
    3061515
  • Title

    Lexical attraction for text compression

  • Author

    Bach, Joscha ; Witten, Ian H.

  • Author_Institution
    Dept. of Comput. Sci., Humboldt-Univ., Berlin, Germany
  • fYear
    1999
  • fDate
    29-31 Mar 1999
  • Firstpage
    516
  • Abstract
    [Summary form only given]. The best methods of text compression work by conditioning each symbol´s probability on its predecessors. Prior symbols establish a context that governs the probability distribution for the next one, and the actual. The next symbol is encoded with respect to this distribution. However, the best predictors for words in natural language are not necessarily their immediate predecessors. Verbs may depend on nouns, pronouns on names, closing brackets on opening ones, question marks on “wh”-words. To establish a more appropriate dependency structure, the lexical attraction of a pair of words is defined as the likelihood that they will appear (in that order) within a sentence, regardless of how far apart they are. This is estimated by counting the co-occurrences of words in the sentences of a large corpus. Then, for each sentence, an undirected (planar, acydic) graph is found that maximizes the lexical attraction between linked items, effectively reorganizing the text in the form of a low-entropy model. We encode a series of linked sentences and transmit them in the same manner as order-1 word-level PPM. To prime the lexical attraction linker, the whole document is processed to acquire the co-occurrence counts, and again to re-link the sentences. Pairs that occur twice or less are excluded from the statistics, which significantly reduces the size of the model. The encoding stage utilizes an adaptive PPM-style method. Encouraging results have been obtained using this method
  • Keywords
    data compression; graph theory; natural languages; probability; text analysis; acydic graph; adaptive PPM-style method; co-occurrence counts; encoding stage; large corpus; lexical attraction; linked sentences; low-entropy model; natural language; order-1 word-level PPM; planar graph; prior symbols; probability distribution; sentences; text compression; undirected graph; Adaptive coding; Computer science; Costs; Couplings; Decoding; Encoding; Mutual information; Natural languages; Probability distribution; Statistical distributions;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference, 1999. Proceedings. DCC '99
  • Conference_Location
    Snowbird, UT
  • ISSN
    1068-0314
  • Print_ISBN
    0-7695-0096-X
  • Type

    conf

  • DOI
    10.1109/DCC.1999.785673
  • Filename
    785673