Web information extraction based on IEBIDTech

Author

Ren, Xiaoyan ; Fu, Yunxia

Author_Institution

China Three Gorges University, Yichang, 443002, China

fYear

2012

Firstpage

239

Lastpage

241

Abstract

Based on the survey of contemporary web information extraction theory, this paper studies the frequently-discussed but insufficiently-solved problem: data extracting from web pages containing several structured records, and proposes a new approach called IEBIDTech(Information Extraction based on Improved Dom Tree) which is mainly composed of three steps. At step 1, the given page is initially segmented into several blocks according to html delimiters after the transformation of a DOM tree, and the redundant blocks are subsequently removed, which is then followed by the induction of extraction rules at step 2 and the extraction of structured data at step 3. Large numbers of experiments from diverse domains´ web pages show that both recall and precision rates are greater than 90%. That is this approach is able to extract data more accurately.

Keywords

DOM tree; Induction algorithm; Web information extraction; Web page denoising; Web page segmentation;

fLanguage

English

Publisher

ieee

Conference_Titel

World Automation Congress (WAC), 2012

Conference_Location

Puerto Vallarta, Mexico

ISSN

2154-4824

Print_ISBN

978-1-4673-4497-5

Type

conf

Filename

6321297

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=1580663