• DocumentCode
    3122345
  • Title

    Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

  • Author

    Behm, Alexander ; Ji, Shengyue ; Li, Chen ; Lu, Jiaheng

  • Author_Institution
    Dept. of Comput. Sci., Univ. of California, Irvine, CA
  • fYear
    2009
  • fDate
    March 29 2009-April 2 2009
  • Firstpage
    604
  • Lastpage
    615
  • Abstract
    Answering approximate queries on string collections is important in applications such as data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. Many existing algorithms use gram-based inverted-list indexing structures to answer approximate string queries. These indexing structures are "notoriously" large compared to the size of their original string collection. In this paper, we study how to reduce the size of such an indexing structure to a given amount of space, while retaining efficient query processing. We first study how to adopt existing inverted-list compression techniques to solve our problem. Then, we propose two novel approaches for achieving the goal: one is based on discarding gram lists, and one is based on combining correlated lists. They are both orthogonal to existing compression techniques, exploit a unique property of our setting, and offer new opportunities for improving query performance. For each approach we analyze its effect on query performance and develop algorithms for wisely choosing lists to discard or combine. Our extensive experiments on real data sets show that our approaches provide applications the flexibility in deciding the tradeoff between query performance and indexing size, and can outperform existing compression techniques. An interesting and surprising finding is that while we can reduce the index size significantly (up to 60% reduction) with tolerable performance penalties, for 20-40% reductions we can even improve query performance compared to original indexes.
  • Keywords
    data compression; indexing; query processing; answering approximate string queries; data cleaning; inverted-list compression techniques; query processing; query relaxation; space-constrained gram-based indexing; spell checking; Application software; Cleaning; Computer science; Data engineering; Delay; Indexing; Information systems; Intrusion detection; Knowledge engineering; Laboratories; Approximate String Search; Compression; Grams; Inverted Lists;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2009. ICDE '09. IEEE 25th International Conference on
  • Conference_Location
    Shanghai
  • ISSN
    1084-4627
  • Print_ISBN
    978-1-4244-3422-0
  • Electronic_ISBN
    1084-4627
  • Type

    conf

  • DOI
    10.1109/ICDE.2009.32
  • Filename
    4812439