DocumentCode :
2082711
Title :
On supporting effective web extraction
Author :
Han, Wook-Shin ; Kwak, Wooseong ; Yu, Pan
Author_Institution :
Dept. of Comput. Eng., Kyungpook Nat. Univ., Daegu, South Korea
fYear :
2010
fDate :
1-6 March 2010
Firstpage :
773
Lastpage :
775
Abstract :
Commercial tuple extraction systems have enjoyed some success to extract tuples by regarding HTML pages as tree structures and exploiting XPath queries to find attributes of tuples in the HTML pages. However, such systems would be vulnerable to small changes on the web pages. In this paper, we propose a robust tuple extraction system which utilizes spatial relationships among elements rather than the XPath queries of the elements. Our system regards elements in the rendered page as spatial objects in the 2-D space and executes spatial joins to extract target elements. Since humans also identify an element in a web page by its relative spatial location, our system extracting elements by their spatial relationships could possibly be as robust as manual extraction and is far more robust than existing tuple extraction systems.
Keywords :
Internet; query processing; tree data structures; 2D space; HTML pages; XPath queries; effective Web extraction; robust tuple extraction system; spatial objects; spatial relationship; tree structures; tuple extraction systems; Cities and towns; Computer science; Data mining; HTML; Humans; Mashups; Robustness; Spatial resolution; Tree data structures; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering (ICDE), 2010 IEEE 26th International Conference on
Conference_Location :
Long Beach, CA
Print_ISBN :
978-1-4244-5445-7
Electronic_ISBN :
978-1-4244-5444-0
Type :
conf
DOI :
10.1109/ICDE.2010.5447932
Filename :
5447932
Link To Document :
بازگشت