Effective Incremental Clustering for Duplicate Detection in Large Databases

Author

Folino, Francesco ; Manco, Giuseppe ; Pontieri, Luigi

Author_Institution

ICAR, CNR, Rende

fYear

2006

fDate

Dec. 2006

Firstpage

45

Lastpage

52

Abstract

We propose an incremental algorithm for discovering clusters of duplicate tuples in large databases. The core of the approach is the usage of an indexing technique which, for any newly arrived tuple mu, allows to efficiently retrieve a set of tuples in the database which are mostly similar to mu, and which are likely to refer to the same real-world entity which is associated with mu. The proposed index is based on a hashing approach which tends to assign similar objects to the same buckets. Empirical and analytical evaluation demonstrates that the proposed approach achieves satisfactory efficiency results, at the cost of low accuracy loss

Keywords

database indexing; pattern clustering; very large databases; duplicate detection; duplicate tuple; hashing approach; incremental clustering algorithm; indexing technique; large database; Clustering algorithms; Costs; Databases; Decision making; Demography; Indexing; Information analysis; Information retrieval; Scalability; Warehousing;

fLanguage

English

Publisher

ieee

Conference_Titel

Database Engineering and Applications Symposium, 2006. IDEAS '06. 10th International

Conference_Location

Delhi

ISSN

1098-8068

Print_ISBN

0-7695-2577-6

Type

conf

DOI

10.1109/IDEAS.2006.18

Filename

4041602