DocumentCode :
3025140
Title :
Fast convergence with a greedy tag-phrase dictionary
Author :
Smith, Tony C. ; Peeters, Ross
Author_Institution :
Dept. of Comput. Sci., Waikato Univ., Hamilton, New Zealand
fYear :
1998
fDate :
30 Mar-1 Apr 1998
Firstpage :
33
Lastpage :
42
Abstract :
Lexical categories have been shown to assist in giving good compression results when incorporated into context models. This paper describes a greedy dictionary-based model that maintains a dictionary of tag-phrases, along with separate lexicons for each unique tag. The text is tagged with part-of-speech (POS) labels and then given to the encoder, which uses the tags to construct the phrase dictionary in a manner similar to LZ78. The output is a sequence of arithmetically encoded phrase number coupled with the information needed to match the correct word with each tag in the phrase. Each unique word (defined as each novel word/tag pair) is transmitted once when it is first encountered, then retained in the appropriate dictionary and thereafter arithmetically encoded according to the empirical distribution for that dictionary whenever the word is encountered. We present results from some empirical tests showing that this “tag-phrase dictionary” technique achieves nearly identical compression as that obtainable using PPM, an explicit-context model. This goes against the widely held view that greedy dictionary schemes require much larger samples of text before they can compete with statistical context methods. Some interesting theoretical issues pertaining to text compression in general are implied, and these are also discussed
Keywords :
convergence of numerical methods; data compression; encoding; glossaries; word processing; PPM; arithmetically encoded phrase number; context models; empirical distribution; empirical tests; encoder; explicit-context model; fast convergence; greedy tag-phrase dictionary; lexical categories; part-of-speech labels; statistical context methods; text compression; unique word; word/tag pair; Compression algorithms; Computer science; Context modeling; Convergence; Dictionaries; Entropy; Natural languages; Predictive models; Probability distribution; Speech;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Compression Conference, 1998. DCC '98. Proceedings
Conference_Location :
Snowbird, UT
ISSN :
1068-0314
Print_ISBN :
0-8186-8406-2
Type :
conf
DOI :
10.1109/DCC.1998.672128
Filename :
672128
Link To Document :
بازگشت