• DocumentCode
    86344
  • Title

    Progressive Duplicate Detection

  • Author

    Papenbrock, Thorsten ; Heise, Arvid ; Naumann, Felix

  • Author_Institution
    Dept. of Inf. Syst., HassoPlattner-Inst., Potsdam, Germany
  • Volume
    27
  • Issue
    5
  • fYear
    2015
  • fDate
    May 1 2015
  • Firstpage
    1316
  • Lastpage
    1329
  • Abstract
    Duplicate detection is the process of identifying multiple representations of same real world entities. Today, duplicate detection methods need to process ever larger datasets in ever shorter time: maintaining the quality of a dataset becomes increasingly difficult. We present two novel, progressive duplicate detection algorithms that significantly increase the efficiency of finding duplicates if the execution time is limited: They maximize the gain of the overall process within the time available by reporting most results much earlier than traditional approaches. Comprehensive experiments show that our progressive algorithms can double the efficiency over time of traditional duplicate detection and significantly improve upon related work.
  • Keywords
    data handling; dataset quality maintenance; efficiency improvement; execution time; overall process gain maximization; progressive duplicate detection methods; real world entities; Algorithm design and analysis; Clustering algorithms; Detection algorithms; Heuristic algorithms; Partitioning algorithms; Runtime; Sorting; Data cleaning; Duplicate Detection; Duplicate detection; Entity Resolution; Pay-as-you-go; Progressiveness; data cleaning; entity resolution; pay-as-you-go; progressiveness;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2014.2359666
  • Filename
    6910276