DocumentCode
2014094
Title
Content-level Annotation of Large Collection of Printed Document Images
Author
Kumar, Anand ; Jawahar, C.V.
Author_Institution
Int. Inst. of Inf. Technol., Hyderabad
Volume
2
fYear
2007
fDate
23-26 Sept. 2007
Firstpage
799
Lastpage
803
Abstract
A large annotated corpus is critical to the development of robust optical character recognizers (OCRs). However, creation of annotated corpora is a tedious task. It is laborious, especially when the annotation is at the character level. In this paper, we propose an efficient hierarchical approach for annotation of large collection of printed document images. We align document images with independently keyed-in text. The method is model-driven and is intended to annotate large collection of documents, scanned in three different resolutions, at character level. We employ an XML representation for storage of the annotation information. APIs are provided for access at content level for easy use in training and evaluation of OCRs and other document understanding tasks.
Keywords
XML; optical character recognition; XML representation; annotated data; document images; optical character recognizers; Buildings; Character recognition; Data mining; Image recognition; Information technology; Machine learning algorithms; Natural languages; Optical character recognition software; Robustness; Writing;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on
Conference_Location
Parana
ISSN
1520-5363
Print_ISBN
978-0-7695-2822-9
Type
conf
DOI
10.1109/ICDAR.2007.4377025
Filename
4377025
Link To Document