• DocumentCode
    2533842
  • Title

    An incremental clustering scheme for duplicate detection in large databases

  • Author

    Cesario, Eugenio ; Folino, Francesco ; Manco, Giuseppe ; Pontieri, Luigi

  • Author_Institution
    ICAR-CNR, Rende, Italy
  • fYear
    2005
  • fDate
    25-27 July 2005
  • Firstpage
    89
  • Lastpage
    95
  • Abstract
    We propose an incremental algorithm for clustering duplicate tuples in large databases, which allows to assign any new tuple t to the cluster containing the database tuples which are most similar to t (and hence are likely to refer to the same real-world entity t is associated with). The core of the approach is a hash-based indexing technique that tends to assign highly similar objects to the same buckets. Empirical evaluation proves that the proposed method allows to gain considerable efficiency improvement over a state-of-art index structure for proximity searches in metric spaces.
  • Keywords
    database indexing; database tuples; duplicate detection; duplicate tuples; hash-based indexing; incremental clustering; index structure; large databases; metric spaces; proximity searches; Clustering algorithms; Clustering methods; Couplings; Data engineering; Delay; Extraterrestrial measurements; Indexing; Information retrieval; Scalability; Spatial databases;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database Engineering and Application Symposium, 2005. IDEAS 2005. 9th International
  • ISSN
    1098-8068
  • Print_ISBN
    0-7695-2404-4
  • Type

    conf

  • DOI
    10.1109/IDEAS.2005.10
  • Filename
    1540899