Title :
OCR-Free Table of Contents Detection in Urdu Books
Author :
Ul-Hasan, Adnan ; Bukhari, Syed Saqib ; Shafait, Faisal ; Breuel, Thomas M.
Author_Institution :
Dept. of Comput. Sci., Tech. Univ. of Kaiserslautern, Kaiserslautern, Germany
Abstract :
Table of Contents (ToC) is an integral part of multiple-page documents like books, magazines, etc. Most of the existing techniques use textual similarity for automatically detecting ToC pages. However, such techniques may not be applied for detection of ToC pages in situations where OCR technology is not available, which is indeed true for historical documents and many modern Nabataean (Arabic) and Indic scripts. It is, therefore, necessary to develop tools to navigate through such documents without the use of OCR. This paper reports a preliminary effort to address this challenge. The proposed algorithm has been applied to find Table of Contents (ToC) pages in Urdu books and an overall initial accuracy of 88% has been achieved.
Keywords :
document image processing; history; Indic scripts; OCR technology; OCR-free table of contents detection; Urdu books; historical documents; magazines; modern Nabataean scripts; multiple-page documents; textual similarity; Feature extraction; Image segmentation; Navigation; Optical character recognition software; Text analysis; Training; Vectors; Auto MLP; Book structure extraction; OCR-free ToC detection; Urdu document image analysis;
Conference_Titel :
Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on
Conference_Location :
Gold Cost, QLD
Print_ISBN :
978-1-4673-0868-7
DOI :
10.1109/DAS.2012.59