DocumentCode :
3486189
Title :
A Simple Equation Region Detector for Printed Document Images in Tesseract
Author :
Zongyi Liu ; Smith, Ross
Author_Institution :
Google Inc., Kirkland, WA, USA
fYear :
2013
fDate :
25-28 Aug. 2013
Firstpage :
245
Lastpage :
249
Abstract :
Detecting equation regions from scanned books has received attention in the document image research community in the past few years. Compared with regular text blocks, equation regions have more complicated layouts so we can not simply use text lines to model them. On the other hand, these regions consist of text symbols that can be reflowed, so that the OCR engines should parse them instead of rasterizing them like image regions. In this paper, we present an equation detector with two major contributions: (i) it is built on a simple algorithm that uses the density of special symbols, such that no additional classifier is required, (ii) it has been built into the open source Tesseract that can be accessed and used by the OCR community. The algorithm is tested on the Google Books database with 1534 entries sampled from books/magazines/newspapers of over thirty languages. And we show that Tesseract performance is improved after enabling the detector.
Keywords :
document image processing; optical character recognition; Google books database; OCR community; OCR engines; Tesseract; detecting equation regions; document image research community; equation detector; equation regions; image regions; printed document images; scanned books; simple equation region detector; text blocks; text symbols; Detectors; Equations; Google; Layout; Mathematical model; Optical character recognition software; Text analysis; document image processing; equation region detection; layout analysis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location :
Washington, DC
ISSN :
1520-5363
Type :
conf
DOI :
10.1109/ICDAR.2013.56
Filename :
6628621
Link To Document :
بازگشت