• DocumentCode
    3489426
  • Title

    An OCR System with OCRopus for Scientific Documents Containing Mathematical Formulas

  • Author

    Furukori, F. ; Yamazaki, Shumpei ; Miyagishi, T. ; Shirai, Keigo ; Okamoto, Mitsuo

  • fYear
    2013
  • fDate
    25-28 Aug. 2013
  • Firstpage
    1175
  • Lastpage
    1179
  • Abstract
    This paper describes the installation of a mathematical formula recognition module into an open source OCR system: OCRopus. In particular we consider the identification of inline formulas utilizing existing modules. Text lines including math formulas are first processed using a N-gram language model to reduce the number of formula candidates by thresholding the conditional probability of words. Then the formula candidates are classified into formulas and texts by SVM using geometric features associated with the bounding boxes of symbols.
  • Keywords
    document image processing; geometry; optical character recognition; probability; support vector machines; OCRopus; SVM; conditional probability; geometric features; mathematical formula recognition module; n-gram language model; open source OCR system; scientific documents; text lines; Accuracy; Image recognition; Layout; Mathematical model; Optical character recognition software; Support vector machines; Text recognition;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
  • Conference_Location
    Washington, DC
  • ISSN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2013.238
  • Filename
    6628799