Lexical attraction for text compression

Author

Bach, Joscha ; Witten, Ian H.

Author_Institution

Dept. of Comput. Sci., Humboldt-Univ., Berlin, Germany

fYear

1999

fDate

29-31 Mar 1999

Firstpage

516

Abstract

[Summary form only given]. The best methods of text compression work by conditioning each symbol´s probability on its predecessors. Prior symbols establish a context that governs the probability distribution for the next one, and the actual. The next symbol is encoded with respect to this distribution. However, the best predictors for words in natural language are not necessarily their immediate predecessors. Verbs may depend on nouns, pronouns on names, closing brackets on opening ones, question marks on “wh”-words. To establish a more appropriate dependency structure, the lexical attraction of a pair of words is defined as the likelihood that they will appear (in that order) within a sentence, regardless of how far apart they are. This is estimated by counting the co-occurrences of words in the sentences of a large corpus. Then, for each sentence, an undirected (planar, acydic) graph is found that maximizes the lexical attraction between linked items, effectively reorganizing the text in the form of a low-entropy model. We encode a series of linked sentences and transmit them in the same manner as order-1 word-level PPM. To prime the lexical attraction linker, the whole document is processed to acquire the co-occurrence counts, and again to re-link the sentences. Pairs that occur twice or less are excluded from the statistics, which significantly reduces the size of the model. The encoding stage utilizes an adaptive PPM-style method. Encouraging results have been obtained using this method

Keywords

data compression; graph theory; natural languages; probability; text analysis; acydic graph; adaptive PPM-style method; co-occurrence counts; encoding stage; large corpus; lexical attraction; linked sentences; low-entropy model; natural language; order-1 word-level PPM; planar graph; prior symbols; probability distribution; sentences; text compression; undirected graph; Adaptive coding; Computer science; Costs; Couplings; Decoding; Encoding; Mutual information; Natural languages; Probability distribution; Statistical distributions;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Compression Conference, 1999. Proceedings. DCC '99

Conference_Location

Snowbird, UT

ISSN

1068-0314

Print_ISBN

0-7695-0096-X

Type

conf

DOI

10.1109/DCC.1999.785673

Filename

785673