Title :
A Simple Equation Region Detector for Printed Document Images in Tesseract
Author :
Zongyi Liu ; Smith, Ross
Author_Institution :
Google Inc., Kirkland, WA, USA
Abstract :
Detecting equation regions from scanned books has received attention in the document image research community in the past few years. Compared with regular text blocks, equation regions have more complicated layouts so we can not simply use text lines to model them. On the other hand, these regions consist of text symbols that can be reflowed, so that the OCR engines should parse them instead of rasterizing them like image regions. In this paper, we present an equation detector with two major contributions: (i) it is built on a simple algorithm that uses the density of special symbols, such that no additional classifier is required, (ii) it has been built into the open source Tesseract that can be accessed and used by the OCR community. The algorithm is tested on the Google Books database with 1534 entries sampled from books/magazines/newspapers of over thirty languages. And we show that Tesseract performance is improved after enabling the detector.
Keywords :
document image processing; optical character recognition; Google books database; OCR community; OCR engines; Tesseract; detecting equation regions; document image research community; equation detector; equation regions; image regions; printed document images; scanned books; simple equation region detector; text blocks; text symbols; Detectors; Equations; Google; Layout; Mathematical model; Optical character recognition software; Text analysis; document image processing; equation region detection; layout analysis;
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/ICDAR.2013.56