DocumentCode :
1580663
Title :
Web information extraction based on IEBIDTech
Author :
Ren, Xiaoyan ; Fu, Yunxia
Author_Institution :
China Three Gorges University, Yichang, 443002, China
fYear :
2012
Firstpage :
239
Lastpage :
241
Abstract :
Based on the survey of contemporary web information extraction theory, this paper studies the frequently-discussed but insufficiently-solved problem: data extracting from web pages containing several structured records, and proposes a new approach called IEBIDTech(Information Extraction based on Improved Dom Tree) which is mainly composed of three steps. At step 1, the given page is initially segmented into several blocks according to html delimiters after the transformation of a DOM tree, and the redundant blocks are subsequently removed, which is then followed by the induction of extraction rules at step 2 and the extraction of structured data at step 3. Large numbers of experiments from diverse domains´ web pages show that both recall and precision rates are greater than 90%. That is this approach is able to extract data more accurately.
Keywords :
DOM tree; Induction algorithm; Web information extraction; Web page denoising; Web page segmentation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
World Automation Congress (WAC), 2012
Conference_Location :
Puerto Vallarta, Mexico
ISSN :
2154-4824
Print_ISBN :
978-1-4673-4497-5
Type :
conf
Filename :
6321297
Link To Document :
بازگشت