مرکز منطقه ای اطلاع رساني علوم و فناوري

Abstract :

Compression techniques in the LZ77 family operate by repeatedly searching for strings in a dictionary and then outputting a series of tokens which unambiguously define the chosen sequence of strings. The dictionary is composed of the most-recently matched N symbols, for some implementation-dependent N. The strings to be matched are the prefixes of the remaining input symbols. When a particular prefix has been matched, those symbols are moved from the beginning of the remaining symbols to the end of the dictionary; in general this will cause some symbols to be deleted from the beginning of the dictionary, in order to limit its size to N. Compression algorithms in the LZ77 family perform a greedy choice when looking for the next string of input symbols to match. That is, the longest string of symbols which is found in the current dictionary is chosen as the next match. Many variations of LZ77 have been proposed; some of these attempt to improve compression by sometimes choosing a non-maximal string, if it appears that such a choice might improve the overall compression ratio. In this paper we present an algorithm which computes a set of matches designed to minimize the number of bits output, not necessarily the number of strings matched. In some variants of LZ77, the token stream is itself compressed using a statistical technique, which means the length of a token is not known a priori. However, other LZ77 variants code the tokens using a scheme for which the length of a given token can be computed in advance. In such a case it is computationally feasible to compute the globally optimum set of matches (we refer to this as the optimum parsing of the input). The basic idea is as follows. At each step of the compression process, the number of bits required by an optimum parsing of the input ending at the current position is known. If the longest match available at this point has length m, then candidate optimum parsings for each of the next m positions can be computed by- - adding the number of bits required for the current position to the token lengths for each of the m possible prefixes of the longest match. These m values are compared pairwise to the current values for the next m locations, and for each improved bit count, the new value and a pointer to the current location are stored. When the end of the input is reached the pointers are traced backwards from the final input symbol to compute the optimum parsing. The Calgary Corpus was used as the test data. An implementation of LZSS which has a maximum match length of 16, a dictionary of 4K symbols and token sizes known a priori was used as the base algorithm. Our algorithm reduced the average compression ratio from 45.28% to 42.64%, a (relative) improvement of better than 5.8%.