• DocumentCode
    2438860
  • Title

    Automatic generation of character groundtruth for scanned documents: a closed-loop approach

  • Author

    Kanungo, Tapas ; Haralick, Robert M.

  • Author_Institution
    Caere Corp., Los Gatos, CA, USA
  • Volume
    3
  • fYear
    1996
  • fDate
    25-29 Aug 1996
  • Firstpage
    669
  • Abstract
    Character groundtruth for scanned document images is crucial for evaluating OCR system performance, training OCR algorithms, and validating document degradation models. Manual collection of accurate character groundtruth in a real (scanned) document image is not possible because (i) accuracy in delineating groundtruth character bounding boxes is too low, (ii) it is very laborious and time consuming and (iii) the manual labor required is prohibitively expensive. We present a closed-loop methodology. We first create ideal documents using a typesetting language. Next we create the groundtruth for the ideal document. The document is then printed, photocopied and scanned. A registration algorithm estimates the geometric transformation that registers the ideal document image to the scanned document image. Finally, groundtruth associated with the ideal document image is transformed using the estimated geometric transform to create the groundtruth for the scanned document image. This methodology is very general and can be used for creating groundtruth for documents typeset in any language, layout, font, and style. The cost of creating groundtruth using our methodology is minimal. We use this methodology to groundtruth 33 English documents consisting of over 62000 symbols. The procedure takes approximately 5 minutes per page on a SUN Sparc 10. We also use the method for Hindi and FAX documents
  • Keywords
    closed loop systems; document image processing; image registration; optical character recognition; performance evaluation; 5 min; OCR algorithm training; OCR system performance evaluation; character groundtruth generation; closed-loop approach; document degradation model validation; photocopying; registration algorithm; scanned documents; Character generation; Costs; Data mining; Degradation; Image analysis; Natural languages; Optical character recognition software; Sun; Text analysis; Typesetting;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Pattern Recognition, 1996., Proceedings of the 13th International Conference on
  • Conference_Location
    Vienna
  • ISSN
    1051-4651
  • Print_ISBN
    0-8186-7282-X
  • Type

    conf

  • DOI
    10.1109/ICPR.1996.547030
  • Filename
    547030