DocumentCode
2013326
Title
Layout Based Information Extraction from HTML Documents
Author
Burget, Radek
Author_Institution
Brno Univ. of Technol., Brno
Volume
2
fYear
2007
fDate
23-26 Sept. 2007
Firstpage
624
Lastpage
628
Abstract
We propose a method of information extraction from HTML documents based on modelling the visual information in the document. A page segmentation algorithm is used for detecting the document layout and subsequently, the extraction process is based on the analysis of mutual positions of the detected blocks and their visual features. This approach is more robust that the traditional DOM-based methods and it opens new possibilities for the extraction task specification.
Keywords
document handling; hypermedia markup languages; information retrieval; HTML document; document layout detection; document visual information modelling; extraction task specification; layout based information extraction; page segmentation algorithm; visual feature; Algorithm design and analysis; Cascading style sheets; Data mining; HTML; Information analysis; Information technology; Page description languages; Robustness; Text analysis; Web sites;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on
Conference_Location
Parana
ISSN
1520-5363
Print_ISBN
978-0-7695-2822-9
Type
conf
DOI
10.1109/ICDAR.2007.4376990
Filename
4376990
Link To Document