Abstract :
Summary form only given. We present a new implementation of the context sorting data compression method (Yokoo, 1997), which is an on-line adaptive algorithm for text compression. Our key idea is to utilize a new data structure called a prefix list (Yokoo, 1999). The original context sorting compression method, which uses neither explicit modeling nor arithmetic coding can be viewed as a symbol ranking text compressor. In the method presented here, in contrast, we form a context model with the frequency distribution to predict the current symbol. Our context model can exploit contexts of unlimited length, and it is combined with arithmetic coding. In these respects, the proposed method can also be viewed as giving an implementation of PPM* (Cleary and Teahan, 1997). Our space requirement is linear in the string length without depending on the context order. The prefix list is a dynamic data structure, which was proposed primarily to maintain a set of contexts in reverse lexicographic order. We can easily gather previous contexts according to the similarity to the current context. Predicted symbols can also be enumerated as the following symbols in those contexts. While enumerating those predicted symbols, we can completely simulate PPM*. If a set of contexts whose similarities to the current context are d or larger gives only one prediction, then their common d-symbol suffix is said to be a deterministic context. In our method, we begin the PPM mechanism with the shortest deterministic context if any, or with the contexts most similar to the current one otherwise. An escape symbol is emitted each time the similarity between the current, context and the existing context pointed to by a pointer in the prefix list decreases
Keywords :
arithmetic codes; data compression; data structures; sorting; string matching; text analysis; PPM*; arithmetic coding; context sorting; d-symbol suffix; data compression; deterministic context; dynamic data structure; escape symbol; frequency distribution; on-line adaptive algorithm; prediction; prefix list; reverse lexicographic order; simulation; string length; text compression; Sorting;