مرکز منطقه ای اطلاع رساني علوم و فناوري - An Open Source Tesseract Based Optical Character Recognizer for Bangla Script

DocumentCode :

1632203

Title :

An Open Source Tesseract Based Optical Character Recognizer for Bangla Script

Author :

Hasnat, Md Abul ; Chowdhury, Muttakinur Rahman ; Khan, Mumit

Author_Institution :

Dept. of Comput. Sci. & Eng., BRAC Univ., Dhaka, Bangladesh

fYear :

2009

Firstpage :

671

Lastpage :

675

Abstract :

BanglaOCR is currently the only open source optical character recognition (OCR) software for the Bangla (Bengali) script developed by the Center for Research on Bangla Language Processing (CRBLP). Tesseract, maintained by Google, is considered to be one of the most accurate free open source OCR engines currently available. In this paper, we present a new OCR for the Bangla/Bengali script that combines the recognition power of Tesseract and the Bangla script processing power of BanglaOCR by integrating the Tesseract recognition engine into BanglaOCR. We first present the complete methodology to build the combined OCR, followed by the implementation strategy. In this paper, we focus on the training data preparation process, Tesseract integration procedure and the post-processing techniques. The techniques described in this paper can be readily applied to build OCRs for other scripts as well.

Keywords :

natural language processing; optical character recognition; public domain software; search engines; Bangla script; Center for Research on Bangla Language Processing; Google; Tesseract recognition engine; open source optical character recognition; training data preparation process; Character recognition; Discrete cosine transforms; Graphical user interfaces; Open source software; Optical character recognition software; Optical sensors; Packaging; Search engines; Testing; Training data; BanglaOCR; Tesseract;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Document Analysis and Recognition, 2009. ICDAR '09. 10th International Conference on

Conference_Location :

Barcelona

ISSN :

1520-5363

Print_ISBN :

978-1-4244-4500-4

Electronic_ISBN :

1520-5363

Type :

conf

DOI :

10.1109/ICDAR.2009.62

Filename :

5277476

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1632203