• DocumentCode
    3489237
  • Title

    Search Space Reduction for Holistic Ligature Recognition in Urdu Nastalique Script

  • Author

    El-Korashy, Amer ; Shafait, Faisal

  • fYear
    2013
  • fDate
    25-28 Aug. 2013
  • Firstpage
    1125
  • Lastpage
    1129
  • Abstract
    This paper addresses the problem of holistic recognition of printed ligatures in Nastalique writing style of the Urdu language. The main difficulty of the recognition process lies in the large number of classes/ligatures (17,000 different possible ligatures in our Urdu text data). This large number of classes not only limits the efficiency (run-time) of the recognition algorithms, but also makes it difficult to use state-of-the-art classifiers - like Random Forests - that can only handle up to a few hundred classes. Nearest neighbor classifiers scale up well to tackle such large-scale classification problems, however their poor run-time efficiency poses a major obstacle. In this paper, we investigate two strategies for improving the efficiency (reducing the search space) of nearest neighbor based classification of Urdu ligatures. The first approach uses spectral hashing to resort to approximate nearest neighbor classification. The second approach is based on the idea of hierarchical classification to partition the search space based on the number of characters in a ligature. Experiments using spectral hashing show that the search space of nearest neighbor comparison can be reduced by about 50% without a loss in recognition accuracy. Further experiments demonstrate that the Random Forest classifier can be reliably used as the first stage classifier to distinguish one-character ligatures from multiple-character ligatures in a hierarchical classification scheme. We hope that the ideas presented in this paper would build the foundations for practical large-scale ligature classification systems not only for Nastalique, but also for other Urdu and Arabic scripts.
  • Keywords
    character recognition; decision trees; information retrieval; natural language processing; pattern classification; text analysis; Arabic scripts; Nastalique writing style; Urdu Nastalique script; Urdu language; Urdu text data; hierarchical classification; holistic ligature recognition; large-scale classification problems; large-scale ligature classification systems; nearest neighbor classifiers; nearest neighbor-based classification efficiency improvement; printed ligatures; random forest classifier; recognition algorithms; search space partitioning; search space reduction; state-of-the-art classifiers; uses spectral hashing; Accuracy; Context; Histograms; Shape; Text recognition; Training; Vectors; Character Recognition; Nearest Neighbor classification; Urdu Nastalique;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
  • Conference_Location
    Washington, DC
  • ISSN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2013.228
  • Filename
    6628789