DocumentCode :
2730423
Title :
Group Linkage
Author :
Byung-Won On ; Koudas, N. ; Dongwon Lee ; Srivastava, Divesh
Author_Institution :
Penn State Univ., USA
fYear :
2007
fDate :
15-20 April 2007
Firstpage :
496
Lastpage :
505
Abstract :
Poor quality data is prevalent in databases due to a variety of reasons, including transcription errors, lack of standards for recording database fields, etc. To be able to query and integrate such data, considerable recent work has focused on the record linkage problem, i.e., determine if two entities represented as relational records are approximately the same. Often entities are represented as groups of relational records, rather than individual relational records, e.g., households in a census survey consist of a group of persons. We refer to the problem of determining if two entities represented as groups are approximately the same as group linkage. Intuitively, two groups can be linked to each other if (i) there is high enough similarity between "matching" pairs of individual records that constitute the two groups, and (ii) there is a large fraction of such matching record pairs. In this paper, we formalize this intuition and propose a group linkage measure based on bipartite graph matching. Given a data set consisting of a large number of groups, efficiently finding groups with a high group linkage similarity to an input query group requires quickly eliminating the many groups that are unlikely to be desired matches. To enable this task, we present simpler group similarity measures that can be used either during fast pre-processing steps or as approximations to our proposed group linkage measure. These measures can be easily instantiated using SQL, permitting our techniques to be implemented inside the database system itself. We experimentally validate the utility of our measures and techniques using a variety of real and synthetic data sets.
Keywords :
SQL; graph theory; relational databases; SQL; bipartite graph matching; database system; group linkage measure; record linkage problem; relational records; Bibliographies; Bipartite graph; Couplings; Database systems; Degradation; Error correction; Impedance; Relational databases; Software libraries;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on
Conference_Location :
Istanbul
Print_ISBN :
1-4244-0802-4
Type :
conf
DOI :
10.1109/ICDE.2007.367895
Filename :
4221698
Link To Document :
بازگشت