DocumentCode :
1938443
Title :
Prediction by grammatical match
Author :
Lake, J. Michael
Author_Institution :
Dept. of Comput. Sci., North Carolina State Univ., Raleigh, NC, USA
fYear :
2000
fDate :
2000
Firstpage :
153
Lastpage :
162
Abstract :
We present prediction by grammatical match (PGM), a new general purpose adaptive text compression framework successfully blending finite-context and general context-free models. A PGM compressor operates incrementally by parsing a prefix of the input text, generating a set of analyses; these analyses are scored according to encoding cost, the cheapest is selected, and sent through an order k PPM encoder. PGM´s primary innovations include the use of a generalized PPM in selection and coding; the simultaneous use of multiple context-free grammars; the use of lexical left-corner derivations (LLCD); and an aggressive algorithm for constructing an LR(O) parsable metalanguage for LLCD. LLCD are a hybrid of bottom-up and top-down descriptions that represent grammatical information implicitly with each lexeme. The constructed metalanguage extends this to include explicit top-down steps to resolve local ambiguities in at most one strictly grammatical symbol. These properties combine to deliver excellent compression. On a test corpus of about 1 MB of Scheme program text, PGM with a generic Scheme grammar required about 26% fewer bits than PPM to represent the entire corpus, with reductions on individual files reaching as high as 55%. In addition, PGM enriches the time-compression-memory tradeoff options, since a low order PGM can achieve bpc rates comparable to high order PPM at considerable savings in space. PGM compression operates in expected linear time and space for many kinds of grammars. PGM decompression operates in guaranteed linear time and space
Keywords :
Scheme; computational complexity; context-free grammars; data compression; string matching; text analysis; adaptive text compression; bottom-up descriptions; context-free models; encoding cost; finite-context models; generic Scheme grammar; grammatical match; lexical left-corner derivations; linear time; multiple context-free grammars; parsable metalanguage; prediction; prefix parsing; text analyses; top-down descriptions; Context modeling; Costs; Encoding; Predictive models; Technological innovation; Testing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Compression Conference, 2000. Proceedings. DCC 2000
Conference_Location :
Snowbird, UT
ISSN :
1068-0314
Print_ISBN :
0-7695-0592-9
Type :
conf
DOI :
10.1109/DCC.2000.838155
Filename :
838155
Link To Document :
بازگشت