Title :
Extraction Rule Language for Web Information Extraction and Integration
Author :
Wu Wei ; Shengsheng Shi ; Yulong Liu ; Haitao Wang ; Chunfeng Yuan ; Yihua Huang
Author_Institution :
Dept. of Comput. Sci. & Technol., Nanjing Univ., Nanjing, China
Abstract :
The Web is the largest data source that contains a lot of valuable information of interests to users or applications. However, how to automatically navigate and extract useful data from web pages is an important issue to study. There have been a number of existing studies on this area. However, most of them do not take enough consideration on complete process of web information extraction and lack of powerful rule expression ability to describe the navigation, extraction and integration rules. In this paper, we study and propose a new web information extraction rule language toward a general model for web information extraction and integration. We first introduce a source data objects to extract different type of web data records. Then we adopt the XML to define the target data entity structure and use scripts to perform target data record integration. The results show that our extraction rule language can provide powerful and flexible ability to describe extraction logic to achieve accurate web data records extraction from complex web pages.
Keywords :
Internet; Web sites; XML; data integration; information retrieval; Web data records extraction; Web information extraction; Web information extraction rule language; Web information integration; Web pages; XML; complex Web pages; data entity structure; data record integration; data source; rule expression ability; Data mining; Data models; Feature extraction; HTML; Navigation; Web pages; Data record; Extraction model; Extraction rule language; Web information extraction;
Conference_Titel :
Web Information System and Application Conference (WISA), 2013 10th
Conference_Location :
Yangzhou
Print_ISBN :
978-1-4799-3218-4
DOI :
10.1109/WISA.2013.21