DocumentCode :
3322939
Title :
Approximate Joins for Data-Centric XML
Author :
Augsten, Nikolaus ; Böhlen, Michael ; Dyreson, Curtis ; Gamper, Johann
Author_Institution :
Fac. of Comput. Sci., Free Univ. of Bozen-Bolzano, Bozen
fYear :
2008
fDate :
7-12 April 2008
Firstpage :
814
Lastpage :
823
Abstract :
In data integration applications, a join matches elements that are common to two data sources. Often, however, elements are represented slightly different in each source, so an approximate join must be used. For XML data, most approximate join strategies are based on some ordered tree matching technique. But in data-centric XML the order is irrelevant: two elements should match even if their subelement order varies. In this paper we give a solution for the approximate join of unordered trees. Our solution is based on windowed pq-grams. We develop an efficient technique to systematically generate windowed pq-grams in a three-step process: sorting the unordered tree, extending the sorted tree with dummy nodes, and computing the windowed pq-grams on the extended tree. The windowed pq-gram distance between two sorted trees approximates the tree edit distance between the respective unordered trees. The approximate join algorithm based on windowed pq-grams is implemented as an equality join on strings which avoids the costly computation of the distance between every pair of input trees. Our experiments with synthetic and real world data confirm the analytic results and suggest that our technique is both useful and scalable.
Keywords :
XML; tree data structures; XML data; approximate join strategies; data integration applications; data sources; data-centric XML; join matches elements; ordered tree matching; windowed pq-grams; Application software; Computer science; Data analysis; Internet; Partitioning algorithms; Polynomials; Shape; Sorting; XML;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on
Conference_Location :
Cancun
Print_ISBN :
978-1-4244-1836-7
Electronic_ISBN :
978-1-4244-1837-4
Type :
conf
DOI :
10.1109/ICDE.2008.4497490
Filename :
4497490
Link To Document :
بازگشت