Improving Book OCR by Adaptive Language and Image Models

Author

Lee, Dar-Shyang ; Smith, Ray

Author_Institution

Google Inc., Mountain View, CA, USA

fYear

2012

fDate

27-29 March 2012

Firstpage

115

Lastpage

119

Abstract

In order to cope with the vast diversity of book content and typefaces, it is important for OCR systems to leverage the strong consistency within a book but adapt to variations across books. We describe a system that combines two parallel correction paths using document-specific image and language models. Each model adapts to shapes and vocabularies within a book to identify inconsistencies as correction hypotheses, but relies on the other for effective cross-validation. Using the open source Tesseract engine as baseline, results on a large data set of scanned books demonstrate that word error rates can be reduced by 25 percent using this approach.

Keywords

document image processing; optical character recognition; adaptive language model; book OCR improvement; book content; correction hypothesis; document-specific image model; open source Tesseract engine; parallel correction paths; typefaces; Conferences; Text analysis; adaptive OCR; document-specific OCR; error correction;

fLanguage

English

Publisher

ieee

Conference_Titel

Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on

Conference_Location

Gold Cost, QLD

Print_ISBN

978-1-4673-0868-7

Type

conf

DOI

10.1109/DAS.2012.45

Filename

6195346

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=2011079