DocumentCode :
2769348
Title :
Visual Content Structures for Wrapper Induction in Building Metasearch Systems
Author :
Tsay, Jyh-Jong ; Tsay, Chin-Wen ; Wang, Xin-Jie
Author_Institution :
Dept. of Comput. Sci. & Inf. Eng., Nat. Chung Cheng Univ., Chiayi, Taiwan
Volume :
1
fYear :
2010
fDate :
Aug. 31 2010-Sept. 3 2010
Firstpage :
180
Lastpage :
183
Abstract :
As there are more and more online sources available on the Web, it becomes very time-consuming, if not impossible, to visit and search all web sites, one by one. Many search engines has been developed to help users find information of their need. However, search engines work poor for online sources whose data are often in deep web, which is not part of surface web indexed by standard search engines. Metasearch is a very popular mechanism to search deep web. Metasearch provides the capability for users to search and access all of the information sources in one query submission. One of the fundamental problems in building metasearch systems is to learn wrappers which extract and integrate data records from query result pages returned from online sources. In this paper, develop an unsupervised approach for wrapper induction that combines visual, content and HTML tag information. Our approach first learns a visual content model that alleviates HTML tag differences among data records, and then finds a tag model from all data records that match the visual content model. Experiment shows that our approach works well for data sets collected from well-known search engines and shopping websites.
Keywords :
Web sites; hypermedia markup languages; query processing; retail data processing; search engines; HTML tag information; metasearch systems; query submission; search engines; shopping Websites; unsupervised approach; visual content structures; wrapper induction; VCWI; data record extraction; metasearch system; web information extraction; wrapper induction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on
Conference_Location :
Toronto, ON
Print_ISBN :
978-1-4244-8482-9
Electronic_ISBN :
978-0-7695-4191-4
Type :
conf
DOI :
10.1109/WI-IAT.2010.40
Filename :
5616254
Link To Document :
بازگشت