Transform based approach for Indic script identification from handwritten document images

Author

Obaidullah, Sk Md ; Karim, Rownaqul ; Shaikh, Sujal ; Halder, Chayan ; Das, Nibaran ; Roy, Kaushik

Author_Institution

Aliah Univ., Kolkata, India

fYear

2015

fDate

26-28 March 2015

Firstpage

1

Lastpage

7

Abstract

In a multi-script country like India script identification from document images is an essential step before choosing appropriate script specific OCR (Optical Character Recognizer). The problem of handwritten script identification is more challenging compared to printed one due to uneven variations with respect to writers, time, content etc. Increasing efforts are coming day by day from document image processing researchers to develop standard techniques for Indic script identification. But most of the works is found to be considering printed script document images. In this paper a simple, robust and segmentation free technique based on different image transform methods and statistical features to identify any one of the four popular Indic scripts namely Bangla, Roman, Devanagari and Oriya is proposed. A dataset of total 101 handwritten document images comprising of more than 11000 words and 1300 lines with almost equal distribution of each type of scripts are built, which were collected from different writers with varying age, sex and educational qualification. On experimentation, an average accuracy rate of 88.1% is found for Four-scripts combination by MLP (Multilayer Perceptron) classifier after five fold cross validation. The average Tri-Scripts and Bi-Scripts accuracy are found to be 89.7% and 94.3% respectively. The outcome of this work is really impressive considering inherent complexities of handwritten Indic scripts.

Keywords

document image processing; handwritten character recognition; image classification; image segmentation; multilayer perceptrons; natural language processing; optical character recognition; transforms; Bangla; Devanagari; India script identification; Indic script identification; MLP classifier; OCR; Oriya; Roman; bi-scripts accuracy; document image processing researcher; handwritten document image; handwritten script identification; image transform method; multilayer perceptron classifier; multiscript country; optical character recognizer; printed script document image; segmentation free technique; statistical feature; transform based approach; tri-scripts accuracy; Discrete cosine transforms; Encoding; Euclidean distance; Handwriting recognition; Image recognition; Image segmentation; Optical imaging; Handwritten Script Identification; Image Transform; MLP Classifier; OCR;

fLanguage

English

Publisher

ieee

Conference_Titel

Signal Processing, Communication and Networking (ICSCN), 2015 3rd International Conference on

Conference_Location

Chennai

Print_ISBN

978-1-4673-6822-3

Type

conf

DOI

10.1109/ICSCN.2015.7219852

Filename

7219852