مرکز منطقه ای اطلاع رساني علوم و فناوري - Efficient and Effective Duplicate Detection in Hierarchical Data

DocumentCode :

268087

Title :

Efficient and Effective Duplicate Detection in Hierarchical Data

Author :

LeitaÌƒo, L. ; Calado, Pavel ; Herschel, M.

Author_Institution :

Inst. Super. Tecnico, Porto Salvo, Portugal

Volume :

Issue :

fYear :

2013

fDate :

May-13

Firstpage :

1028

Lastpage :

1041

Abstract :

Although there is a long line of work on identifying duplicates in relational data, only a few solutions focus on duplicate detection in more complex hierarchical structures, like XML data. In this paper, we present a novel method for XML duplicate detection, called XMLDup. XMLDup uses a Bayesian network to determine the probability of two XML elements being duplicates, considering not only the information within the elements, but also the way that information is structured. In addition, to improve the efficiency of the network evaluation, a novel pruning strategy, capable of significant gains over the unoptimized version of the algorithm, is presented. Through experiments, we show that our algorithm is able to achieve high precision and recall scores in several data sets. XMLDup is also able to outperform another state-of-the-art duplicate detection solution, both in terms of efficiency and of effectiveness.

Keywords :

XML; belief networks; probability; Bayesian network; XML data; XML duplicate detection; XMLDup; complex hierarchical structures; hierarchical data; novel pruning strategy; probability; relational data; Bayesian methods; Databases; Electronic mail; Random variables; Semantics; XML; Bayesian networks; Duplicate detection; XML; data cleaning; entity resolution; optimization; record linkage;

fLanguage :

English

Journal_Title :

Knowledge and Data Engineering, IEEE Transactions on

Publisher :

ieee

ISSN :

1041-4347

Type :

jour

DOI :

10.1109/TKDE.2012.60

Filename :

6171189

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=268087