DocumentCode
3061833
Title
Improving text compression ratios with the Burrows-Wheeler transform
Author
Kruse, Holger ; Mukherjee, Amar
Author_Institution
Dept. of Comput. Sci., Central Florida Univ., Orlando, FL, USA
fYear
1999
fDate
29-31 Mar 1999
Firstpage
536
Abstract
[Summary form only given]. In this paper we describe several methods that can be used to improve the compression ratio of compression algorithms based an the Burrows-Wheeler transform, as implemented in `bzip2´, when used in combination with English language text files. We first briefly describe the Burrows-Wheeler transform and some of its strengths and weaknesses. We then describe our implementations of enhancements and test results. All of our enhancements are implemented in the form of pre- and postprocessing algorithms that are used in addition to the transform. The use of our algorithms has resulted in improvements in compression ratios of up to 20% over bzip2. The first method involves changing the order of characters in the character set encoding of the input file. We describe different ways to improve the encoding order. A more elaborate method involves minimizing the contextual differences between subsequent codes, using an optimized encoding order obtained by solving a traveling salesman problem. In the second method we replaced frequently occurring bigrams by single unused character codes, in order to better utilize the available character encoding space. In most cases this method did not result in improvements in the compression ratio. The final method is a refinement of the method we presented in “Preprocessing Text to Improve Compression Ratios”, DCC 98. A common English language dictionary is used to preprocess input text and replace English words with encodings that are more easily compressible by bzip2. In this paper we describe optimizations to our original method to further improve the compression ratio. We also show practical ways to use our method on wide-area networks like the Internet, and easy ways to solve the problem of dictionary distribution and caching
Keywords
Internet; cache storage; data compression; optimisation; text analysis; transform coding; transforms; wide area networks; Burrows-Wheeler transform; English language dictionary; English language text files; Internet; bigrams; bzip; character set encoding; compression algorithms; compression program; consonants; dictionary caching; dictionary distribution; input file; optimized encoding order; single unused character codes; text compression ratios; traveling salesman problem; vowels; wide-area networks; Compression algorithms; Data preprocessing; Dictionaries; Encoding; IP networks; Natural languages; Optimization methods; Space exploration; Testing; Traveling salesman problems;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Compression Conference, 1999. Proceedings. DCC '99
Conference_Location
Snowbird, UT
ISSN
1068-0314
Print_ISBN
0-7695-0096-X
Type
conf
DOI
10.1109/DCC.1999.785693
Filename
785693
Link To Document