Search Space Reduction for Holistic Ligature Recognition in Urdu Nastalique Script

Author

El-Korashy, Amer ; Shafait, Faisal

fYear

2013

fDate

25-28 Aug. 2013

Firstpage

1125

Lastpage

1129

Abstract

This paper addresses the problem of holistic recognition of printed ligatures in Nastalique writing style of the Urdu language. The main difficulty of the recognition process lies in the large number of classes/ligatures (17,000 different possible ligatures in our Urdu text data). This large number of classes not only limits the efficiency (run-time) of the recognition algorithms, but also makes it difficult to use state-of-the-art classifiers - like Random Forests - that can only handle up to a few hundred classes. Nearest neighbor classifiers scale up well to tackle such large-scale classification problems, however their poor run-time efficiency poses a major obstacle. In this paper, we investigate two strategies for improving the efficiency (reducing the search space) of nearest neighbor based classification of Urdu ligatures. The first approach uses spectral hashing to resort to approximate nearest neighbor classification. The second approach is based on the idea of hierarchical classification to partition the search space based on the number of characters in a ligature. Experiments using spectral hashing show that the search space of nearest neighbor comparison can be reduced by about 50% without a loss in recognition accuracy. Further experiments demonstrate that the Random Forest classifier can be reliably used as the first stage classifier to distinguish one-character ligatures from multiple-character ligatures in a hierarchical classification scheme. We hope that the ideas presented in this paper would build the foundations for practical large-scale ligature classification systems not only for Nastalique, but also for other Urdu and Arabic scripts.

Keywords

character recognition; decision trees; information retrieval; natural language processing; pattern classification; text analysis; Arabic scripts; Nastalique writing style; Urdu Nastalique script; Urdu language; Urdu text data; hierarchical classification; holistic ligature recognition; large-scale classification problems; large-scale ligature classification systems; nearest neighbor classifiers; nearest neighbor-based classification efficiency improvement; printed ligatures; random forest classifier; recognition algorithms; search space partitioning; search space reduction; state-of-the-art classifiers; uses spectral hashing; Accuracy; Context; Histograms; Shape; Text recognition; Training; Vectors; Character Recognition; Nearest Neighbor classification; Urdu Nastalique;

fLanguage

English

Publisher

ieee

Conference_Titel

Document Analysis and Recognition (ICDAR), 2013 12th International Conference on

Conference_Location

Washington, DC

ISSN

1520-5363

Type

conf

DOI

10.1109/ICDAR.2013.228

Filename

6628789

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=3489237