DocumentCode :
3122345
Title :
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search
Author :
Behm, Alexander ; Ji, Shengyue ; Li, Chen ; Lu, Jiaheng
Author_Institution :
Dept. of Comput. Sci., Univ. of California, Irvine, CA
fYear :
2009
fDate :
March 29 2009-April 2 2009
Firstpage :
604
Lastpage :
615
Abstract :
Answering approximate queries on string collections is important in applications such as data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. Many existing algorithms use gram-based inverted-list indexing structures to answer approximate string queries. These indexing structures are "notoriously" large compared to the size of their original string collection. In this paper, we study how to reduce the size of such an indexing structure to a given amount of space, while retaining efficient query processing. We first study how to adopt existing inverted-list compression techniques to solve our problem. Then, we propose two novel approaches for achieving the goal: one is based on discarding gram lists, and one is based on combining correlated lists. They are both orthogonal to existing compression techniques, exploit a unique property of our setting, and offer new opportunities for improving query performance. For each approach we analyze its effect on query performance and develop algorithms for wisely choosing lists to discard or combine. Our extensive experiments on real data sets show that our approaches provide applications the flexibility in deciding the tradeoff between query performance and indexing size, and can outperform existing compression techniques. An interesting and surprising finding is that while we can reduce the index size significantly (up to 60% reduction) with tolerable performance penalties, for 20-40% reductions we can even improve query performance compared to original indexes.
Keywords :
data compression; indexing; query processing; answering approximate string queries; data cleaning; inverted-list compression techniques; query processing; query relaxation; space-constrained gram-based indexing; spell checking; Application software; Cleaning; Computer science; Data engineering; Delay; Indexing; Information systems; Intrusion detection; Knowledge engineering; Laboratories; Approximate String Search; Compression; Grams; Inverted Lists;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering, 2009. ICDE '09. IEEE 25th International Conference on
Conference_Location :
Shanghai
ISSN :
1084-4627
Print_ISBN :
978-1-4244-3422-0
Electronic_ISBN :
1084-4627
Type :
conf
DOI :
10.1109/ICDE.2009.32
Filename :
4812439
Link To Document :
بازگشت