DocumentCode :
2589471
Title :
Learning nongenerative grammatical models for document analysis
Author :
Shilman, Michael ; Liang, Percy ; Viola, Paul
Author_Institution :
Microsoft Res., Redmond, WA
Volume :
2
fYear :
2005
fDate :
17-21 Oct. 2005
Firstpage :
962
Abstract :
We present a general approach for the hierarchical segmentation and labeling of document layout structures. This approach models document layout as a grammar and performs a global search for the optimal parse based on a grammatical cost function. Our contribution is to utilize machine learning to discriminatively select features and set all parameters in the parsing process. Therefore, and unlike many other approaches for layout analysis, ours can easily adapt itself to a variety of document analysis problems. One need only specify the page grammar and provide a set of correctly labeled pages. We apply this technique to two document image analysis tasks: page layout structure extraction and mathematical expression interpretation. Experiments demonstrate that the learned grammars can be used to extract the document structure in 57 files from the UWIII document image database. We also show that the same framework can be used to automatically interpret printed mathematical expressions so as to recreate the original LaTeX
Keywords :
document image processing; grammars; learning (artificial intelligence); text analysis; LaTeX; document analysis; document image analysis; document layout structures; feature selection; grammatical cost function; machine learning; mathematical expression interpretation; nongenerative grammatical models; page grammar; page layout structure extraction; parsing; Computer languages; Cost function; Dynamic programming; Image analysis; Image databases; Labeling; Libraries; Machine learning; Parameter estimation; Text analysis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on
Conference_Location :
Beijing
ISSN :
1550-5499
Print_ISBN :
0-7695-2334-X
Type :
conf
DOI :
10.1109/ICCV.2005.140
Filename :
1544825
Link To Document :
بازگشت