• DocumentCode
    3340248
  • Title

    Automated OCR Ground Truth Generation

  • Author

    Beusekom, Joost van ; Shafait, Faisal ; Breuel, Thomas M.

  • fYear
    2008
  • fDate
    16-19 Sept. 2008
  • Firstpage
    111
  • Lastpage
    117
  • Abstract
    Most optical character recognition (OCR) systems need to be trained and tested on the symbols that are to be recognized. Therefore, ground truth data is needed. This data consists of character images together with their ASCII code. Among the approaches for generating ground truth of real world data, one promising technique is to use electronic version of the scanned documents. Using an alignment method, the character bounding boxes extracted from the electronic document are matched to the scanned image. Current alignment methods are not robust to different similarity transforms. They also need calibration to deal with non-linear local distortions introduced by the printing/scanning process. In this paper we present a significant improvement over existing methods, allowing to skip the calibration step and having a more accurate alignment, under all similarity transforms. Our method finds a robust and pixel accurate scanner independent alignment of the scanned image with the electronic document, allowing the extraction of accurate ground truth character information. The accuracy of the alignment is demonstrated using documents from the UW3 dataset. The results show that the mean distance between the estimated and the ground truth character bounding box position is less than one pixel.
  • Keywords
    Calibration; Character recognition; Data mining; Nonlinear distortion; Nonlinear optics; Optical character recognition software; Optical distortion; Printing; Robustness; System testing; ground truth generation; image registration; optical character recognition;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis Systems, 2008. DAS '08. The Eighth IAPR International Workshop on
  • Conference_Location
    Nara, Japan
  • Print_ISBN
    978-0-7695-3337-7
  • Type

    conf

  • DOI
    10.1109/DAS.2008.59
  • Filename
    4669952