Title :
Multi-lingual cascading text compressors for WWW
Author_Institution :
Sch. of Comput., Nat. Univ. of Singapore, Singapore
Abstract :
Global sharing and distribution of information on the Internet result in a great demand for efficient multi-lingual text compression for Web servers and proxy implementations. Current text compressors such as Huffman coding, Lempel-Ziv (LZ) variants, and LZ-Huffman cascading fail to perform efficiently because of the mis-matched character sampling size and the large character set of multilingual languages. Our previous research has shown that a better compression ratio can be obtained by re-adjusting the character sampling rate. We investigate the cascading of LZ variants to Huffman coding for multilingual documents. Two basic approaches, static and dynamic dictionaries, are proposed. Techniques for reducing the dictionary overhead are also suggested. Based on our multi-lingual corpus, our adaptive cascading scheme can perform better than the well-known cascading compressor, gzip, by an average of about 20%
Keywords :
Huffman codes; Internet; data compression; document handling; text analysis; Huffman coding; Internet; LZ-Huffman cascading; Lempel-Ziv; WWW; Web servers; adaptive cascading; character sampling rate; compression ratio; dictionary overhead; dynamic dictionaries; gzip; large character set; mis-matched character sampling size; multi-lingual corpus; multilingual cascading text compressors; multilingual documents; static dictionaries; text compression; Compression algorithms; Compressors; Dictionaries; Huffman coding; Internet; Natural languages; Sampling methods; Tellurium; Web server; World Wide Web;
Conference_Titel :
Information Technology: Coding and Computing, 2000. Proceedings. International Conference on
Conference_Location :
Las Vegas, NV
Print_ISBN :
0-7695-0540-6
DOI :
10.1109/ITCC.2000.844279