Title :
Comparing Approaches to Mathematical Document Analysis from PDF
Author :
Baker, Josef B. ; Sexton, Alan P. ; Sorge, Volker ; Suzuki, Masakazu
Author_Institution :
Sch. of Comput. Sci., Univ. of Birmingham, Birmingham, UK
Abstract :
Document analysis of mathematical texts is a challenging problem even for born-digital documents in standard formats. We present alternative approaches addressing this problem in the context of PDF documents. One uses an OCR approach for character recognition together with a virtual link network for structural analysis. The other uses direct extraction of symbol information from the PDF file with a two stage parser to extract layout and expression structures. With reference to ground truth data, we compare the effectiveness and accuracy of the two techniques quantitatively with respect to character identification and structural analysis of mathematical expressions and qualitatively with respect to layout analysis.
Keywords :
document image processing; mathematics computing; optical character recognition; OCR approach; PDF documents; born digital documents; character recognition; expression structures; mathematical document analysis; mathematical texts; structural analysis; symbol information extraction; two stage parser; virtual link network; Character recognition; Data mining; Layout; Mathematics; Optical character recognition software; Portable document format; White spaces; Math formula recognition; PDF; layout analysis;
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4577-1350-7
Electronic_ISBN :
1520-5363
DOI :
10.1109/ICDAR.2011.99