DocumentCode :
2548189
Title :
Effective Indices for Efficient Approximate String Search and Similarity Join
Author :
Liu, Xuhui ; Li, Guoliang ; Feng, Jianhua ; Zhou, Lizhu
Author_Institution :
Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing
fYear :
2008
fDate :
20-22 July 2008
Firstpage :
127
Lastpage :
134
Abstract :
Data collections often have inconsistencies that arise due to a variety of reasons, and it is desirable to be able to identify and resolve them efficiently. Similarity queries are commonly used in data cleaning for matching similar data. In this work we concentrate on the following problem of approximate string matching based on edit distance: from a collection of strings, how to find those strings similar to a given string, or the strings in another collection of strings with similarity greater than some threshold? We propose an NFA-based (nondeterministic finite-state automation) method for effective approximate string search. We model strings as a trie and construct an NFA on top of the trie. We identify the similar strings by running the NFA based on the tree automata theory. Moreover, we propose grouped trie to further improve the performance of similarity search by incorporating some effective pruning techniques. We have implemented our method and the experimental results show that our approach achieves high performance and out performs the existing state-of-the-art methods by orders of magnitude.
Keywords :
finite automata; query processing; string matching; tree searching; data collection; edit distance; nondeterministic finite-state automation; pruning technique; similarity query; similarity search; string matching; string search; tree automata; Automata; Automation; Cleaning; Computer science; Databases; Dictionaries; Information management; Query processing; indexing; similarity join; similarity search;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Web-Age Information Management, 2008. WAIM '08. The Ninth International Conference on
Conference_Location :
Zhangjiajie Hunan
Print_ISBN :
978-0-7695-3185-4
Electronic_ISBN :
978-0-7695-3185-4
Type :
conf
DOI :
10.1109/WAIM.2008.17
Filename :
4597005
Link To Document :
بازگشت