Layout Based Information Extraction from HTML Documents

Author

Burget, Radek

Author_Institution

Brno Univ. of Technol., Brno

Volume

fYear

2007

fDate

23-26 Sept. 2007

Firstpage

624

Lastpage

628

Abstract

We propose a method of information extraction from HTML documents based on modelling the visual information in the document. A page segmentation algorithm is used for detecting the document layout and subsequently, the extraction process is based on the analysis of mutual positions of the detected blocks and their visual features. This approach is more robust that the traditional DOM-based methods and it opens new possibilities for the extraction task specification.

Keywords

document handling; hypermedia markup languages; information retrieval; HTML document; document layout detection; document visual information modelling; extraction task specification; layout based information extraction; page segmentation algorithm; visual feature; Algorithm design and analysis; Cascading style sheets; Data mining; HTML; Information analysis; Information technology; Page description languages; Robustness; Text analysis; Web sites;

fLanguage

English

Publisher

ieee

Conference_Titel

Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on

Conference_Location

Parana

ISSN

1520-5363

Print_ISBN

978-0-7695-2822-9

Type

conf

DOI

10.1109/ICDAR.2007.4376990

Filename

4376990

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=2013326