DocumentCode :
2438860
Title :
Automatic generation of character groundtruth for scanned documents: a closed-loop approach
Author :
Kanungo, Tapas ; Haralick, Robert M.
Author_Institution :
Caere Corp., Los Gatos, CA, USA
Volume :
3
fYear :
1996
fDate :
25-29 Aug 1996
Firstpage :
669
Abstract :
Character groundtruth for scanned document images is crucial for evaluating OCR system performance, training OCR algorithms, and validating document degradation models. Manual collection of accurate character groundtruth in a real (scanned) document image is not possible because (i) accuracy in delineating groundtruth character bounding boxes is too low, (ii) it is very laborious and time consuming and (iii) the manual labor required is prohibitively expensive. We present a closed-loop methodology. We first create ideal documents using a typesetting language. Next we create the groundtruth for the ideal document. The document is then printed, photocopied and scanned. A registration algorithm estimates the geometric transformation that registers the ideal document image to the scanned document image. Finally, groundtruth associated with the ideal document image is transformed using the estimated geometric transform to create the groundtruth for the scanned document image. This methodology is very general and can be used for creating groundtruth for documents typeset in any language, layout, font, and style. The cost of creating groundtruth using our methodology is minimal. We use this methodology to groundtruth 33 English documents consisting of over 62000 symbols. The procedure takes approximately 5 minutes per page on a SUN Sparc 10. We also use the method for Hindi and FAX documents
Keywords :
closed loop systems; document image processing; image registration; optical character recognition; performance evaluation; 5 min; OCR algorithm training; OCR system performance evaluation; character groundtruth generation; closed-loop approach; document degradation model validation; photocopying; registration algorithm; scanned documents; Character generation; Costs; Data mining; Degradation; Image analysis; Natural languages; Optical character recognition software; Sun; Text analysis; Typesetting;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Pattern Recognition, 1996., Proceedings of the 13th International Conference on
Conference_Location :
Vienna
ISSN :
1051-4651
Print_ISBN :
0-8186-7282-X
Type :
conf
DOI :
10.1109/ICPR.1996.547030
Filename :
547030
Link To Document :
بازگشت