• DocumentCode
    1625096
  • Title

    A Primitive Operator for Similarity Joins in Data Cleaning

  • Author

    Chaudhuri, Surajit ; Ganti, Venkatesh ; Kaushik, Raghav

  • Author_Institution
    Microsoft Research
  • fYear
    2006
  • Firstpage
    5
  • Lastpage
    5
  • Abstract
    Data cleaning based on similarities involves identification of "close" tuples, where closeness is evaluated using a variety of similarity functions chosen to suit the domain and application. Current approaches for efficiently implementing such similarity joins are tightly tied to the chosen similarity function. In this paper, we propose a new primitive operator which can be used as a foundation to implement similarity joins according to a variety of popular string similarity functions, and notions of similarity which go beyond textual similarity. We then propose efficient implementations for this operator. In an experimental evaluation using real datasets, we show that the implementation of similarity joins using our operator is comparable to, and often substantially better than, previous customized implementations for particular similarity functions.
  • Keywords
    Cleaning; Couplings; Data engineering; Data warehouses; Database systems; Hamming distance; Marketing and sales; Performance evaluation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2006. ICDE '06. Proceedings of the 22nd International Conference on
  • Print_ISBN
    0-7695-2570-9
  • Type

    conf

  • DOI
    10.1109/ICDE.2006.9
  • Filename
    1617373