DocumentCode :
3487709
Title :
Devanagari Text Recognition: A Transcription Based Formulation
Author :
Sankaran, Naveen ; Neelappa, Aman ; Jawahar, C.V.
Author_Institution :
Int. Inst. of Inf. Technol., Hyderabad, India
fYear :
2013
fDate :
25-28 Aug. 2013
Firstpage :
678
Lastpage :
682
Abstract :
Optical Character Recognition (OCR) problems are often formulated as isolated character (symbol) classification task followed by a post-classification stage (which contains modules like Unicode generation, error correction etc.) to generate the textual representation, for most of the Indian scripts. Such approaches are prone to failures due to (i) difficulties in designing reliable word-to-symbol segmentation module that can robustly work in presence of degraded (cut/fused) images and (ii) converting the outputs of the classifiers to a valid sequence of Unicodes. In this paper, we propose a formulation, where the expectations on these two modules is minimized, and the harder recognition task is modelled as learning of an appropriate sequence to sequence translation scheme. We thus formulate the recognition as a direct transcription problem. Given many examples of feature sequences and their corresponding Unicode representations, our objective is to learn a mapping which can convert a word directly into a Unicode sequence. This formulation has multiple practical advantages: (i) This reduces the number of classes significantly for the Indian scripts. (ii) It removes the need for a reliable word-to-symbol segmentation. (ii) It does not require strong annotation of symbols to design the classifiers, and (iii) It directly generates a valid sequence of Unicodes. We test our method on more than 6000 pages of printed Devanagari documents from multiple sources. Our method consistently outperforms other state of the art implementations.
Keywords :
image classification; natural language processing; optical character recognition; text analysis; Devanagari text recognition; Indian scripts; OCR problems; Unicode sequence; character symbol classification; error correction; optical character recognition; printed Devanagari documents; sequence translation; textual representation; transcription based formulation; unicode generation; word-to-symbol segmentation; word-to-symbol segmentation module; Accuracy; Character recognition; Degradation; Hidden Markov models; Image segmentation; Optical character recognition software; Training; BLSTM; Devanagari; OCR;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location :
Washington, DC
ISSN :
1520-5363
Type :
conf
DOI :
10.1109/ICDAR.2013.139
Filename :
6628704
Link To Document :
بازگشت