DocumentCode
153367
Title
Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique
Author
Akram, Qurat Ul Ain ; Hussain, Shiraz ; Niazi, Aneta ; Anjum, Umair ; Irfan, Faheem
Author_Institution
Center for Language Eng., Univ. of Eng. & Technol., Lahore, Pakistan
fYear
2014
fDate
7-10 April 2014
Firstpage
191
Lastpage
195
Abstract
Tesseract engine supports multilingual text recognition. However, the recognition of cursive scripts using Tesseract is a challenging task. In this paper, Tesseract engine is analyzed and modified for the recognition of Nastalique writing style for Urdu language which is a very complex and cursive writing style of Arabic script. Original Tesseract system has 65.59% and 65.84% accuracies for 14 and 16 font sizes respectively, whereas the modified system, with reduced search space, gives 97.87% and 97.71% accuracies respectively. The efficiency is also improved from an average of 170 milliseconds (ms) to an average of 84 ms for the recognition of Nastalique document images.
Keywords
document image processing; handwritten character recognition; natural language processing; Arabic script; Nastalique document images; Nastalique writing style; Tesseract engine; Urdu language; complex scripts; cursive scripts recognition; multilingual text recognition; Accuracy; Character recognition; Engines; Optical character recognition software; Shape; Text recognition; Writing; Nastalique; OCR; Tesseract; Urdu;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on
Conference_Location
Tours
Print_ISBN
978-1-4799-3243-6
Type
conf
DOI
10.1109/DAS.2014.45
Filename
6830996
Link To Document