An incremental clustering scheme for duplicate detection in large databases

Author

Cesario, Eugenio ; Folino, Francesco ; Manco, Giuseppe ; Pontieri, Luigi

Author_Institution

ICAR-CNR, Rende, Italy

fYear

2005

fDate

25-27 July 2005

Firstpage

89

Lastpage

95

Abstract

We propose an incremental algorithm for clustering duplicate tuples in large databases, which allows to assign any new tuple t to the cluster containing the database tuples which are most similar to t (and hence are likely to refer to the same real-world entity t is associated with). The core of the approach is a hash-based indexing technique that tends to assign highly similar objects to the same buckets. Empirical evaluation proves that the proposed method allows to gain considerable efficiency improvement over a state-of-art index structure for proximity searches in metric spaces.

Keywords

database indexing; database tuples; duplicate detection; duplicate tuples; hash-based indexing; incremental clustering; index structure; large databases; metric spaces; proximity searches; Clustering algorithms; Clustering methods; Couplings; Data engineering; Delay; Extraterrestrial measurements; Indexing; Information retrieval; Scalability; Spatial databases;

fLanguage

English

Publisher

ieee

Conference_Titel

Database Engineering and Application Symposium, 2005. IDEAS 2005. 9th International

ISSN

1098-8068

Print_ISBN

0-7695-2404-4

Type

conf

DOI

10.1109/IDEAS.2005.10

Filename

1540899