• DocumentCode
    2973120
  • Title

    Effective Incremental Clustering for Duplicate Detection in Large Databases

  • Author

    Folino, Francesco ; Manco, Giuseppe ; Pontieri, Luigi

  • Author_Institution
    ICAR, CNR, Rende
  • fYear
    2006
  • fDate
    Dec. 2006
  • Firstpage
    45
  • Lastpage
    52
  • Abstract
    We propose an incremental algorithm for discovering clusters of duplicate tuples in large databases. The core of the approach is the usage of an indexing technique which, for any newly arrived tuple mu, allows to efficiently retrieve a set of tuples in the database which are mostly similar to mu, and which are likely to refer to the same real-world entity which is associated with mu. The proposed index is based on a hashing approach which tends to assign similar objects to the same buckets. Empirical and analytical evaluation demonstrates that the proposed approach achieves satisfactory efficiency results, at the cost of low accuracy loss
  • Keywords
    database indexing; pattern clustering; very large databases; duplicate detection; duplicate tuple; hashing approach; incremental clustering algorithm; indexing technique; large database; Clustering algorithms; Costs; Databases; Decision making; Demography; Indexing; Information analysis; Information retrieval; Scalability; Warehousing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database Engineering and Applications Symposium, 2006. IDEAS '06. 10th International
  • Conference_Location
    Delhi
  • ISSN
    1098-8068
  • Print_ISBN
    0-7695-2577-6
  • Type

    conf

  • DOI
    10.1109/IDEAS.2006.18
  • Filename
    4041602