• DocumentCode
    1514460
  • Title

    Scalable Iterative Graph Duplicate Detection

  • Author

    Herschel, Melanie ; Naumann, Felix ; Szott, Sascha ; Taubert, Maik

  • Author_Institution
    Wilhelm-Schickard Inst. fur Inf., Univ. Tubingen, Tubingen, Germany
  • Volume
    24
  • Issue
    11
  • fYear
    2012
  • Firstpage
    2094
  • Lastpage
    2108
  • Abstract
    Duplicate detection determines different representations of real-world objects in a database. Recent research has considered the use of relationships among object representations to improve duplicate detection. In the general case where relationships form a graph, research has mainly focused on duplicate detection quality/effectiveness. Scalability has been neglected so far, even though it is crucial for large real-world duplicate detection tasks. We scale-up duplicate detection in graph data (DDG) to large amounts of data and pairwise comparisons, using the support of a relational database management system. To this end, we first present a framework that generalizes the DDG process. We then present algorithms to scale DDG in space (amount of data processed with bounded main memory) and in time. Finally, we extend our framework to allow batched and parallel DDG, thus further improving efficiency. Experiments on data of up to two orders of magnitude larger than data considered so far in DDG show that our methods achieve the goal of scaling DDG to large volumes of data.
  • Keywords
    graph theory; iterative methods; relational databases; DDG; duplicate detection in graph data; object representations; real-world objects; relational database management system; scalable iterative graph duplicate detection; Classification algorithms; Databases; Image edge detection; Motion pictures; Runtime; Scalability; Sorting; Duplicate detection; data cleaning; data integration; entity resolution; parallelization; record linkage; scalability;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2011.99
  • Filename
    5765953