DocumentCode
2148764
Title
High Performance Layout Analysis of Arabic and Urdu Document Images
Author
Bukhari, Syed Saqib ; Shafait, Faisal ; Breuel, Thomas M.
Author_Institution
Tech. Univ. of Kaiserslautern, Kaiserslautern, Germany
fYear
2011
fDate
18-21 Sept. 2011
Firstpage
1275
Lastpage
1279
Abstract
Text-lines extraction and their reading order determination is an important step in optical character recognition (OCR) systems. Research in OCR of Arabic script documents has primarily focused on character recognition and therefore most of researchers use primitive methods like projection profile analysis for text-line extraction. Although projection methods achieve good accuracy on clean, skew corrected documents, their performance drops under challenging situations (border noise, skew, complex layouts). This paper presents a robust layout analysis system for extracting text-lines in reading order from scanned Arabic script document images written in different languages (Arabic, Urdu, Persian) and styles (Naskh, Nastaliq). The presented system is based on a suitable combination of different well established techniques for analyzing Latin script documents that have proven to be robust against different types of document image degradations. Evaluation of the presented system on Arabic and Urdu document image datasets consisting of a variety of complex single- and multi-column layouts achieves high accuracies for text and non-text segmentation, text-line extraction, and reading order determination.
Keywords
document image processing; image segmentation; natural language processing; optical character recognition; text analysis; Latin script document analysis; Urdu document images; document image degradations; high performance layout analysis; multicolumn layouts; optical character recognition systems; projection profile analysis; reading order determination; robust layout analysis system; scanned Arabic script document images; skew- corrected documents; text line extraction; text segmentation; Accuracy; Image resolution; Image segmentation; Layout; Morphology; Performance evaluation; Text analysis; Document Layout Analysis; Reading Order Determination; Text Image Segmentation; Text-Line Segmentation;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location
Beijing
ISSN
1520-5363
Print_ISBN
978-1-4577-1350-7
Electronic_ISBN
1520-5363
Type
conf
DOI
10.1109/ICDAR.2011.257
Filename
6065515
Link To Document