• DocumentCode
    153367
  • Title

    Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique

  • Author

    Akram, Qurat Ul Ain ; Hussain, Shiraz ; Niazi, Aneta ; Anjum, Umair ; Irfan, Faheem

  • Author_Institution
    Center for Language Eng., Univ. of Eng. & Technol., Lahore, Pakistan
  • fYear
    2014
  • fDate
    7-10 April 2014
  • Firstpage
    191
  • Lastpage
    195
  • Abstract
    Tesseract engine supports multilingual text recognition. However, the recognition of cursive scripts using Tesseract is a challenging task. In this paper, Tesseract engine is analyzed and modified for the recognition of Nastalique writing style for Urdu language which is a very complex and cursive writing style of Arabic script. Original Tesseract system has 65.59% and 65.84% accuracies for 14 and 16 font sizes respectively, whereas the modified system, with reduced search space, gives 97.87% and 97.71% accuracies respectively. The efficiency is also improved from an average of 170 milliseconds (ms) to an average of 84 ms for the recognition of Nastalique document images.
  • Keywords
    document image processing; handwritten character recognition; natural language processing; Arabic script; Nastalique document images; Nastalique writing style; Tesseract engine; Urdu language; complex scripts; cursive scripts recognition; multilingual text recognition; Accuracy; Character recognition; Engines; Optical character recognition software; Shape; Text recognition; Writing; Nastalique; OCR; Tesseract; Urdu;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on
  • Conference_Location
    Tours
  • Print_ISBN
    978-1-4799-3243-6
  • Type

    conf

  • DOI
    10.1109/DAS.2014.45
  • Filename
    6830996