• DocumentCode
    37051
  • Title

    GFilter: A General Gram Filter for String Similarity Search

  • Author

    Haoji Hu ; Kai Zheng ; Xiaoling Wang ; Aoying Zhou

  • Author_Institution
    Shanghai Key Lab. of Trustworthy Comput., East China Normal Univ., Shanghai, China
  • Volume
    27
  • Issue
    4
  • fYear
    2015
  • fDate
    April 1 2015
  • Firstpage
    1005
  • Lastpage
    1018
  • Abstract
    Numerous applications such as data integration, protein detection, and article copy detection share a similar core problem: given a string as the query, how to efficiently find all the similar answers from a large scale string collection. Many existing methods adopt a prefix-filter-based framework to solve this problem, and a number of recent works aim to use advanced filters to improve the overall search performance. In this paper, we propose a gram-based framework to achieve near maximum filter performance. The main idea is to judiciously choose the high-quality grams as the prefix of query according to their estimated ability to filter candidates. As this selection process is proved to be NP-hard problem, we give a cost model to measure the filter ability of grams and develop efficient heuristic algorithms to find high-quality grams. Extensive experiments on real datasets demonstrate the superiority of the proposed framework in comparison with the state-of-art approaches.
  • Keywords
    computational complexity; data integration; query processing; GFilter; NP-hard problem; article copy detection; data integration; general gram filter; gram-based framework; heuristic algorithm; large scale string collection; prefix-filter-based framework; protein detection; query prefix; search performance; string similarity search; Educational institutions; Greedy algorithms; Indexes; Proteins; Query processing; Radiation detectors; Search problems; Data integration; gram-based framework; similarity search;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2014.2349914
  • Filename
    6880793