A Simple Equation Region Detector for Printed Document Images in Tesseract

Author

Zongyi Liu ; Smith, Ross

Author_Institution

Google Inc., Kirkland, WA, USA

fYear

2013

fDate

25-28 Aug. 2013

Firstpage

245

Lastpage

249

Abstract

Detecting equation regions from scanned books has received attention in the document image research community in the past few years. Compared with regular text blocks, equation regions have more complicated layouts so we can not simply use text lines to model them. On the other hand, these regions consist of text symbols that can be reflowed, so that the OCR engines should parse them instead of rasterizing them like image regions. In this paper, we present an equation detector with two major contributions: (i) it is built on a simple algorithm that uses the density of special symbols, such that no additional classifier is required, (ii) it has been built into the open source Tesseract that can be accessed and used by the OCR community. The algorithm is tested on the Google Books database with 1534 entries sampled from books/magazines/newspapers of over thirty languages. And we show that Tesseract performance is improved after enabling the detector.

Keywords

document image processing; optical character recognition; Google books database; OCR community; OCR engines; Tesseract; detecting equation regions; document image research community; equation detector; equation regions; image regions; printed document images; scanned books; simple equation region detector; text blocks; text symbols; Detectors; Equations; Google; Layout; Mathematical model; Optical character recognition software; Text analysis; document image processing; equation region detection; layout analysis;

fLanguage

English

Publisher

ieee

Conference_Titel

Document Analysis and Recognition (ICDAR), 2013 12th International Conference on

Conference_Location

Washington, DC

ISSN

1520-5363

Type

conf

DOI

10.1109/ICDAR.2013.56

Filename

6628621