• DocumentCode
    2144505
  • Title

    Comparing Approaches to Mathematical Document Analysis from PDF

  • Author

    Baker, Josef B. ; Sexton, Alan P. ; Sorge, Volker ; Suzuki, Masakazu

  • Author_Institution
    Sch. of Comput. Sci., Univ. of Birmingham, Birmingham, UK
  • fYear
    2011
  • fDate
    18-21 Sept. 2011
  • Firstpage
    463
  • Lastpage
    467
  • Abstract
    Document analysis of mathematical texts is a challenging problem even for born-digital documents in standard formats. We present alternative approaches addressing this problem in the context of PDF documents. One uses an OCR approach for character recognition together with a virtual link network for structural analysis. The other uses direct extraction of symbol information from the PDF file with a two stage parser to extract layout and expression structures. With reference to ground truth data, we compare the effectiveness and accuracy of the two techniques quantitatively with respect to character identification and structural analysis of mathematical expressions and qualitatively with respect to layout analysis.
  • Keywords
    document image processing; mathematics computing; optical character recognition; OCR approach; PDF documents; born digital documents; character recognition; expression structures; mathematical document analysis; mathematical texts; structural analysis; symbol information extraction; two stage parser; virtual link network; Character recognition; Data mining; Layout; Mathematics; Optical character recognition software; Portable document format; White spaces; Math formula recognition; PDF; layout analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2011 International Conference on
  • Conference_Location
    Beijing
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4577-1350-7
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2011.99
  • Filename
    6065354