مرکز منطقه ای اطلاع رساني علوم و فناوري - Devanagari Text Recognition: A Transcription Based Formulation

DocumentCode :

3487709

Title :

Devanagari Text Recognition: A Transcription Based Formulation

Author :

Sankaran, Naveen ; Neelappa, Aman ; Jawahar, C.V.

Author_Institution :

Int. Inst. of Inf. Technol., Hyderabad, India

fYear :

2013

fDate :

25-28 Aug. 2013

Firstpage :

678

Lastpage :

682

Abstract :

Optical Character Recognition (OCR) problems are often formulated as isolated character (symbol) classification task followed by a post-classification stage (which contains modules like Unicode generation, error correction etc.) to generate the textual representation, for most of the Indian scripts. Such approaches are prone to failures due to (i) difficulties in designing reliable word-to-symbol segmentation module that can robustly work in presence of degraded (cut/fused) images and (ii) converting the outputs of the classifiers to a valid sequence of Unicodes. In this paper, we propose a formulation, where the expectations on these two modules is minimized, and the harder recognition task is modelled as learning of an appropriate sequence to sequence translation scheme. We thus formulate the recognition as a direct transcription problem. Given many examples of feature sequences and their corresponding Unicode representations, our objective is to learn a mapping which can convert a word directly into a Unicode sequence. This formulation has multiple practical advantages: (i) This reduces the number of classes significantly for the Indian scripts. (ii) It removes the need for a reliable word-to-symbol segmentation. (ii) It does not require strong annotation of symbols to design the classifiers, and (iii) It directly generates a valid sequence of Unicodes. We test our method on more than 6000 pages of printed Devanagari documents from multiple sources. Our method consistently outperforms other state of the art implementations.

Keywords :

image classification; natural language processing; optical character recognition; text analysis; Devanagari text recognition; Indian scripts; OCR problems; Unicode sequence; character symbol classification; error correction; optical character recognition; printed Devanagari documents; sequence translation; textual representation; transcription based formulation; unicode generation; word-to-symbol segmentation; word-to-symbol segmentation module; Accuracy; Character recognition; Degradation; Hidden Markov models; Image segmentation; Optical character recognition software; Training; BLSTM; Devanagari; OCR;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Document Analysis and Recognition (ICDAR), 2013 12th International Conference on

Conference_Location :

Washington, DC

ISSN :

1520-5363

Type :

conf

DOI :

10.1109/ICDAR.2013.139

Filename :

6628704

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3487709