DocumentCode
3486189
Title
A Simple Equation Region Detector for Printed Document Images in Tesseract
Author
Zongyi Liu ; Smith, Ross
Author_Institution
Google Inc., Kirkland, WA, USA
fYear
2013
fDate
25-28 Aug. 2013
Firstpage
245
Lastpage
249
Abstract
Detecting equation regions from scanned books has received attention in the document image research community in the past few years. Compared with regular text blocks, equation regions have more complicated layouts so we can not simply use text lines to model them. On the other hand, these regions consist of text symbols that can be reflowed, so that the OCR engines should parse them instead of rasterizing them like image regions. In this paper, we present an equation detector with two major contributions: (i) it is built on a simple algorithm that uses the density of special symbols, such that no additional classifier is required, (ii) it has been built into the open source Tesseract that can be accessed and used by the OCR community. The algorithm is tested on the Google Books database with 1534 entries sampled from books/magazines/newspapers of over thirty languages. And we show that Tesseract performance is improved after enabling the detector.
Keywords
document image processing; optical character recognition; Google books database; OCR community; OCR engines; Tesseract; detecting equation regions; document image research community; equation detector; equation regions; image regions; printed document images; scanned books; simple equation region detector; text blocks; text symbols; Detectors; Equations; Google; Layout; Mathematical model; Optical character recognition software; Text analysis; document image processing; equation region detection; layout analysis;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location
Washington, DC
ISSN
1520-5363
Type
conf
DOI
10.1109/ICDAR.2013.56
Filename
6628621
Link To Document