Title :
Search Space Reduction for Holistic Ligature Recognition in Urdu Nastalique Script
Author :
El-Korashy, Amer ; Shafait, Faisal
Abstract :
This paper addresses the problem of holistic recognition of printed ligatures in Nastalique writing style of the Urdu language. The main difficulty of the recognition process lies in the large number of classes/ligatures (17,000 different possible ligatures in our Urdu text data). This large number of classes not only limits the efficiency (run-time) of the recognition algorithms, but also makes it difficult to use state-of-the-art classifiers - like Random Forests - that can only handle up to a few hundred classes. Nearest neighbor classifiers scale up well to tackle such large-scale classification problems, however their poor run-time efficiency poses a major obstacle. In this paper, we investigate two strategies for improving the efficiency (reducing the search space) of nearest neighbor based classification of Urdu ligatures. The first approach uses spectral hashing to resort to approximate nearest neighbor classification. The second approach is based on the idea of hierarchical classification to partition the search space based on the number of characters in a ligature. Experiments using spectral hashing show that the search space of nearest neighbor comparison can be reduced by about 50% without a loss in recognition accuracy. Further experiments demonstrate that the Random Forest classifier can be reliably used as the first stage classifier to distinguish one-character ligatures from multiple-character ligatures in a hierarchical classification scheme. We hope that the ideas presented in this paper would build the foundations for practical large-scale ligature classification systems not only for Nastalique, but also for other Urdu and Arabic scripts.
Keywords :
character recognition; decision trees; information retrieval; natural language processing; pattern classification; text analysis; Arabic scripts; Nastalique writing style; Urdu Nastalique script; Urdu language; Urdu text data; hierarchical classification; holistic ligature recognition; large-scale classification problems; large-scale ligature classification systems; nearest neighbor classifiers; nearest neighbor-based classification efficiency improvement; printed ligatures; random forest classifier; recognition algorithms; search space partitioning; search space reduction; state-of-the-art classifiers; uses spectral hashing; Accuracy; Context; Histograms; Shape; Text recognition; Training; Vectors; Character Recognition; Nearest Neighbor classification; Urdu Nastalique;
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/ICDAR.2013.228