DocumentCode
1580663
Title
Web information extraction based on IEBIDTech
Author
Ren, Xiaoyan ; Fu, Yunxia
Author_Institution
China Three Gorges University, Yichang, 443002, China
fYear
2012
Firstpage
239
Lastpage
241
Abstract
Based on the survey of contemporary web information extraction theory, this paper studies the frequently-discussed but insufficiently-solved problem: data extracting from web pages containing several structured records, and proposes a new approach called IEBIDTech(Information Extraction based on Improved Dom Tree) which is mainly composed of three steps. At step 1, the given page is initially segmented into several blocks according to html delimiters after the transformation of a DOM tree, and the redundant blocks are subsequently removed, which is then followed by the induction of extraction rules at step 2 and the extraction of structured data at step 3. Large numbers of experiments from diverse domains´ web pages show that both recall and precision rates are greater than 90%. That is this approach is able to extract data more accurately.
Keywords
DOM tree; Induction algorithm; Web information extraction; Web page denoising; Web page segmentation;
fLanguage
English
Publisher
ieee
Conference_Titel
World Automation Congress (WAC), 2012
Conference_Location
Puerto Vallarta, Mexico
ISSN
2154-4824
Print_ISBN
978-1-4673-4497-5
Type
conf
Filename
6321297
Link To Document