DocumentCode
3025573
Title
Tag based models of English text
Author
Teahan, W.J. ; Cleary, John G.
Author_Institution
Dept. of Comput. Sci., Waikato Univ., Hamilton, New Zealand
fYear
1998
fDate
30 Mar-1 Apr 1998
Firstpage
43
Lastpage
52
Abstract
The problem of compressing English text is important both because of the ubiquity of English as a target for compression and because of the light that compression can shed on the structure of English. English text is examined in conjunction with additional information about the parts of speech of each word in the text (these are referred to as “tags”). It is shown that the tags plus the text can be compressed more than the text alone. Essentially the tags can be compressed for nothing or even a small net saving in size. A comparison is made of a number of different ways of integrating compression of tags and text using an escape mechanism similar to PPM. These are also compared with standard word based and character based compression programs. The result is that the tag and word based schemes always outperform the character based schemes. Overall, the tag based schemes outperform the word based schemes. We conclude by conjecturing that tags chosen for compression rather than linguistic purposes would perform even better
Keywords
data compression; linguistics; word processing; English text compression; PPM; character based compression programs; escape mechanism; linguistics; performance; tag based models; tag compression; word based based compression program; Compressors; Computer graphics; Computer science; Decoding; Encoding; Geophysics computing; Natural languages; Speech recognition; Technical Activities Guide -TAG;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Compression Conference, 1998. DCC '98. Proceedings
Conference_Location
Snowbird, UT
ISSN
1068-0314
Print_ISBN
0-8186-8406-2
Type
conf
DOI
10.1109/DCC.1998.672130
Filename
672130
Link To Document