DocumentCode
2144505
Title
Comparing Approaches to Mathematical Document Analysis from PDF
Author
Baker, Josef B. ; Sexton, Alan P. ; Sorge, Volker ; Suzuki, Masakazu
Author_Institution
Sch. of Comput. Sci., Univ. of Birmingham, Birmingham, UK
fYear
2011
fDate
18-21 Sept. 2011
Firstpage
463
Lastpage
467
Abstract
Document analysis of mathematical texts is a challenging problem even for born-digital documents in standard formats. We present alternative approaches addressing this problem in the context of PDF documents. One uses an OCR approach for character recognition together with a virtual link network for structural analysis. The other uses direct extraction of symbol information from the PDF file with a two stage parser to extract layout and expression structures. With reference to ground truth data, we compare the effectiveness and accuracy of the two techniques quantitatively with respect to character identification and structural analysis of mathematical expressions and qualitatively with respect to layout analysis.
Keywords
document image processing; mathematics computing; optical character recognition; OCR approach; PDF documents; born digital documents; character recognition; expression structures; mathematical document analysis; mathematical texts; structural analysis; symbol information extraction; two stage parser; virtual link network; Character recognition; Data mining; Layout; Mathematics; Optical character recognition software; Portable document format; White spaces; Math formula recognition; PDF; layout analysis;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location
Beijing
ISSN
1520-5363
Print_ISBN
978-1-4577-1350-7
Electronic_ISBN
1520-5363
Type
conf
DOI
10.1109/ICDAR.2011.99
Filename
6065354
Link To Document