Title :
Bit-parallel approach to approximate string matching in compressed texts
Author :
Matsumoto, Tetsuya ; Kida, Takuya ; Takeda, Masayuki ; Shinohara, Ayumi ; Arikawa, Setsuo
Author_Institution :
Dept. of Inf., Kyushu Univ., Fukuoka, Japan
Abstract :
Addresses the problem of approximate string matching on compressed text. We consider this problem for a text string described in terms of a collage system, which is a formal system proposed by T. Kida et al. (1999) that captures various dictionary-based compression methods. We present an algorithm that exploits bit-parallelism, assuming that our problem fits in a single machine word, e.g. (m-k)(k+1)⩽L, where m is the pattern length, k is the number of allowed errors and L is the length, in bits, of the machine word. For a class of simple collage systems, the algorithm runs in O(k2(||𝒟||+|𝒮|)+km) time using O(k2||𝒟||) space, where ||𝒟|| is the size of dictionary 𝒟 and |𝒮| is the number of tokens in the sequence 𝒮. The LZ78 (Lempel-Ziv, 1978) and the LZW (Lempel-Ziv-Welch, 1984) compression methods are covered by this class. Since we can regard n=||𝒟||+|𝒮| as the compressed length, the time and space complexities are O(k2n+km) and O(k 2n), respectively. For general k and m, they become O(k3 mn/L+km) and O(k3mn/L). Thus, our algorithm is competitive to the algorithm proposed by J. Kärkkäinen, et al. (2000), which runs in O(km) time using O(kmn) space, when k=O(√L)
Keywords :
computational complexity; data compression; string matching; Ziv-Lempel compression methods; allowed errors; approximate string matching; bit-parallelism; collage system; compressed length; compressed texts; dictionary size; dictionary-based compression methods; machine word; pattern length; space complexity; time complexity; token sequence; Automata; Dictionaries; Informatics; Pattern matching;
Conference_Titel :
String Processing and Information Retrieval, 2000. SPIRE 2000. Proceedings. Seventh International Symposium on
Conference_Location :
A Curuna
Print_ISBN :
0-7695-0746-8
DOI :
10.1109/SPIRE.2000.878198