DocumentCode :
179754
Title :
Bottom-up region extractor for semi-structured web pages
Author :
Thamviset, Wachirawut ; Wongthanavasu, Sartra
Author_Institution :
Dept. of Comput. Sci., Khon Kaen Univ., Khon Kaen, Thailand
fYear :
2014
fDate :
July 30 2014-Aug. 1 2014
Firstpage :
284
Lastpage :
289
Abstract :
Generally, the database websites have provided the interfaces for giving users access their structured data. These data are usually represented in a form of data records in a coherent region of a result page. However, the page usually contains not only the data region, but also other extraneous ones. Therefore, the important tasks for extracting data records from these semi-structured web pages are identifying the relevant data regions and ignoring the irrelevant regions. To figure out the stated problem, This paper proposes a region extractor to be a preprocessor tool for helping an information extractor to locate and extract the relevant data records from web pages. Most existing works analyze the DOM tree of an input page in a top-down manner. In another way, the proposed method traverses the DOM tree in the bottom-up direction that the similarity of the leaf nodes are analyzed prior to find a set of data items. After that, their parent nodes are recursively analyzed for identifying data records and data regions respectively. The proposed method is completely unsupervised and maintenance-free wrapper. For performance evaluation, it is empirically tested on 15 real-world websites. Experiments show that the proposed method achieves 94.37% accuracy of data record extraction and outperforms the well-known top-down method, DEPTA (55.39%).
Keywords :
Web sites; trees (mathematics); unsupervised learning; DOM tree; bottom-up region extractor; database Websites; information extractor; leaf nodes; maintenance-free wrapper; semi-structured Web pages; unsupervised wrapper; Accuracy; Data mining; Databases; HTML; Web pages; Bottom-Up approach; Object boundary identification algorithms; Region extractor; Semi-structured web documents; Web Mining; Web data extraction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Science and Engineering Conference (ICSEC), 2014 International
Conference_Location :
Khon Kaen
Print_ISBN :
978-1-4799-4965-6
Type :
conf
DOI :
10.1109/ICSEC.2014.6978209
Filename :
6978209
Link To Document :
بازگشت