DocumentCode
2438860
Title
Automatic generation of character groundtruth for scanned documents: a closed-loop approach
Author
Kanungo, Tapas ; Haralick, Robert M.
Author_Institution
Caere Corp., Los Gatos, CA, USA
Volume
3
fYear
1996
fDate
25-29 Aug 1996
Firstpage
669
Abstract
Character groundtruth for scanned document images is crucial for evaluating OCR system performance, training OCR algorithms, and validating document degradation models. Manual collection of accurate character groundtruth in a real (scanned) document image is not possible because (i) accuracy in delineating groundtruth character bounding boxes is too low, (ii) it is very laborious and time consuming and (iii) the manual labor required is prohibitively expensive. We present a closed-loop methodology. We first create ideal documents using a typesetting language. Next we create the groundtruth for the ideal document. The document is then printed, photocopied and scanned. A registration algorithm estimates the geometric transformation that registers the ideal document image to the scanned document image. Finally, groundtruth associated with the ideal document image is transformed using the estimated geometric transform to create the groundtruth for the scanned document image. This methodology is very general and can be used for creating groundtruth for documents typeset in any language, layout, font, and style. The cost of creating groundtruth using our methodology is minimal. We use this methodology to groundtruth 33 English documents consisting of over 62000 symbols. The procedure takes approximately 5 minutes per page on a SUN Sparc 10. We also use the method for Hindi and FAX documents
Keywords
closed loop systems; document image processing; image registration; optical character recognition; performance evaluation; 5 min; OCR algorithm training; OCR system performance evaluation; character groundtruth generation; closed-loop approach; document degradation model validation; photocopying; registration algorithm; scanned documents; Character generation; Costs; Data mining; Degradation; Image analysis; Natural languages; Optical character recognition software; Sun; Text analysis; Typesetting;
fLanguage
English
Publisher
ieee
Conference_Titel
Pattern Recognition, 1996., Proceedings of the 13th International Conference on
Conference_Location
Vienna
ISSN
1051-4651
Print_ISBN
0-8186-7282-X
Type
conf
DOI
10.1109/ICPR.1996.547030
Filename
547030
Link To Document