DocumentCode
2866700
Title
Direct pattern matching on compressed text
Author
De Moura, Edleno Silva ; Navarro, Gonzalo ; Ziviani, Nivio ; Baeza-Yates, Ricardo
Author_Institution
Dept. de Ciencia da Comput., Univ. Fed. de Minas Gerais, Belo Horizonte, Brazil
fYear
1998
fDate
9-11 Sep 1998
Firstpage
90
Lastpage
95
Abstract
We present a fast compression and decompression technique for natural language texts. The novelty is that the exact search can be done on the compressed text directly, using any known sequential pattern matching algorithm. Approximate search can also be done efficiently without any decoding. The compression scheme uses a semi static word based modeling and a Huffman coding where the coding alphabet is byte oriented rather than bit oriented. We use the first bit of each byte to mark the beginning of a word, which allows the searching of the compressed pattern directly on the compressed text. We achieve about 33% compression ratio for typical English texts. When searching for simple patterns, our experiments show that running our algorithm on a compressed text is almost twice as fast as running agrep on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithm is up to 8 times faster than agrep
Keywords
Huffman codes; data compression; natural languages; pattern matching; search problems; string matching; word processing; English texts; Huffman coding; agrep; approximate search; coding alphabet; compressed pattern; compressed text; compression ratio; compression scheme; decompression technique; direct pattern matching; exact search; fast compression; natural language texts; semi static word based modeling; sequential pattern matching algorithm; uncompressed version; Costs; Databases; Decoding; Electrical capacitance tomography; Pattern matching; Scholarships;
fLanguage
English
Publisher
ieee
Conference_Titel
String Processing and Information Retrieval: A South American Symposium, 1998. Proceedings
Conference_Location
Santa Cruz de La Sierra
Print_ISBN
0-8186-8664-2
Type
conf
DOI
10.1109/SPIRE.1998.712987
Filename
712987
Link To Document