• DocumentCode
    610381
  • Title

    Top-k string similarity search with edit-distance constraints

  • Author

    Dong Deng ; Guoliang Li ; Jianhua Feng ; Wen-Syan Li

  • Author_Institution
    Dept. of Comput. Sci., Tsinghua Univ., Beijing, China
  • fYear
    2013
  • fDate
    8-12 April 2013
  • Firstpage
    925
  • Lastpage
    936
  • Abstract
    String similarity search is a fundamental operation in many areas, such as data cleaning, information retrieval, and bioinformatics. In this paper we study the problem of top-k string similarity search with edit-distance constraints, which, given a collection of strings and a query string, returns the top-k strings with the smallest edit distances to the query string. Existing methods usually try different edit-distance thresholds and select an appropriate threshold to find top-k answers. However it is rather expensive to select an appropriate threshold. To address this problem, we propose a progressive framework by improving the traditional dynamic-programming algorithm to compute edit distance. We prune unnecessary entries in the dynamic-programming matrix and only compute those pivotal entries. We extend our techniques to support top-k similarity search. We develop a range-based method by grouping the pivotal entries to avoid duplicated computations. Experimental results show that our method achieves high performance, and significantly outperforms state-of-the-art approaches on real-world datasets.
  • Keywords
    dynamic programming; query processing; string matching; bioinformatics; data cleaning; dynamic programming algorithm; dynamic programming matrix; edit-distance constraint; edit-distance threshold; information retrieval; query string; range-based method; top-k answer; top-k string similarity search; Bioinformatics; Cleaning; Indexes; Search problems; Time complexity;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2013 IEEE 29th International Conference on
  • Conference_Location
    Brisbane, QLD
  • ISSN
    1063-6382
  • Print_ISBN
    978-1-4673-4909-3
  • Electronic_ISBN
    1063-6382
  • Type

    conf

  • DOI
    10.1109/ICDE.2013.6544886
  • Filename
    6544886