• DocumentCode
    3022256
  • Title

    Text degradations and OCR training

  • Author

    Smith, Elisa H Barney ; Andersen, Tim

  • Author_Institution
    Boise State Univ., ID, USA
  • fYear
    2005
  • fDate
    29 Aug.-1 Sept. 2005
  • Firstpage
    834
  • Abstract
    Printing and scanning of text documents introduces degradations to the characters which can be modeled. Interestingly, certain combinations of the parameters that govern the degradations introduced by the printing and scanning process affect characters in such a way that the degraded characters have a similar appearance, while other degradations leave the characters with an appearance that is very different. It is well known that (generally speaking), a test set that more closely matches a training set is recognized with higher accuracy than one that matches the training set less well. Likewise, classifiers tend to perform better on data sets that have lower variance. This paper explores an analytical method that uses a formal printer/scanner degradation model to identify the similarity between groups of degraded characters. This similarity is shown to improve the recognition accuracy of a classifier through model directed choice of training set data.
  • Keywords
    document image processing; image classification; image matching; image scanners; optical character recognition; printers; text analysis; OCR training; character degradation; optical character recognition; printer degradation; scanner degradation; text degradation; text document printing; text document scanning; Character recognition; Degradation; Engines; Nearest neighbor searches; Optical character recognition software; Optical noise; Printers; Printing; Testing; Training data;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on
  • ISSN
    1520-5263
  • Print_ISBN
    0-7695-2420-6
  • Type

    conf

  • DOI
    10.1109/ICDAR.2005.226
  • Filename
    1575662