DocumentCode :
2581133
Title :
Invited Paper: Pattern Matching Using n-Gram Sampling of Cumulative Algebraic Signatures: Preliminary Results
Author :
Litwin, Witold ; MOKADEM, Riad ; Rigaux, Philippe ; Thomas, Schwarz
Author_Institution :
Dauphine Univ., Paris
fYear :
0
fDate :
0-0 0
Firstpage :
251
Lastpage :
255
Abstract :
We propose a novel string (pattern) matching algorithm called n-gram search. We intend it for the records stored once and searched many times in a database or a file, especially those organized in a scalable distributed data structure, (SDDS), over a grid or a structured P2P net. We presume that the records are encoded into their cumulative algebraic signatures, providing incidental confidentiality of stored data. The search starts with pre-processing the pattern, calculating the logarithmic algebraic signature (LAS) of the pattern and the LASs of every n-gram in it. The value of n ges 1 is a parameter that one may tune. The search attempts to match the LASs of n-grams in the pattern towards dynamically calculated LASs, sampled over n-grams in the records. A mismatch generates a shift of up to K - n symbols towards next sample, where K is the pattern length. The whole process is parallel over the SDDS servers and does not require any local decoding. For an M-symbol long record, the unsuccessful search, measured as number of match attempts, costs O ((M - K)/(K - n+1)). The 2-grams should typically suffice, leading to O ((M - K)/(K - 1)). We show that the algorithm particularly efficient for larger strings and records, i.e., with e-documents or DNA data. Preliminary results show then the n-gram search about (K - n + 1) faster than our previous algorithms and among the fastest known, e.g., probably often faster than Boyer-Moore
Keywords :
data structures; distributed databases; grid computing; parallel processing; peer-to-peer computing; search problems; string matching; P2P net; cumulative algebraic signatures; logarithmic algebraic signature; n-gram sampling; n-gram search; pattern matching; scalable distributed data structure; string matching algorithm; Content addressable storage; Costs; DNA; Data structures; Decoding; Distributed databases; Galois fields; Pattern matching; Sampling methods; Special issues and sections; SDDS; algebraic signatures.; distributed pattern matching; grid; scalable; structured P2P;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Database and Expert Systems Applications, 2006. DEXA '06. 17th International Workshop on
Conference_Location :
Krakow
ISSN :
1529-4188
Print_ISBN :
0-7695-2641-1
Type :
conf
DOI :
10.1109/DEXA.2006.80
Filename :
1698345
Link To Document :
بازگشت