DocumentCode :
2021593
Title :
Web Document Parsing: A New Approach to Modeling Layout-Language Relations
Author :
Yoshida, Minoru ; Nakagawa, Hiroshi
Author_Institution :
Univ. of Tokyo, Tokyo
Volume :
1
fYear :
2007
fDate :
23-26 Sept. 2007
Firstpage :
203
Lastpage :
207
Abstract :
We propose a novel approach for extracting semantic structures from Web documents. Our task is to extract trees that describe the hierarchical relations in documents. We developed an algorithm for this task by using the stochastic context free grammar (SCFG) framework. Experiments showed that our approach effectively worked showing performance improvement through the parameter estimation.
Keywords :
Internet; context-free grammars; document handling; stochastic processes; Web document parsing; layout-language relations modeling; parameter estimation; semantic structures; stochastic context free grammar; HTML; Particle separators; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on
Conference_Location :
Parana
ISSN :
1520-5363
Print_ISBN :
978-0-7695-2822-9
Type :
conf
DOI :
10.1109/ICDAR.2007.4378704
Filename :
4378704
Link To Document :
بازگشت