• DocumentCode
    1635231
  • Title

    Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment

  • Author

    Abdulkader, Ahmad ; Casey, Matthew R.

  • Author_Institution
    Google Inc., Mountain View, CA, USA
  • fYear
    2009
  • Firstpage
    576
  • Lastpage
    580
  • Abstract
    We propose a low cost method for the correction of the output of OCR engines through the use of human labor. The method employs an error estimator neural network that learns to assess the error probability of every word from ground truth data. The error estimator uses features computed from the outputs of multiple OCR engines. The output probability error estimate is used to decide which words are inspected by humans. The error estimator is trained to optimize the area under the word error ROC leading to an improved efficiency of the human correction process. A significant reduction in cost is achieved by clustering similar words together during the correction process. We also show how active learning techniques are used to further improve the efficiency of the error estimator.
  • Keywords
    error correction; estimation theory; learning (artificial intelligence); neural nets; optical character recognition; pattern clustering; OCR errors low cost correction; active learning technique; error estimator neural network; ground truth data; human correction process efficiency; human labor; multi-engine environment; multiple OCR engines output; word error probability output; words clustering; Books; Costs; Error analysis; Error correction; Humans; Machine learning; Neural networks; Optical character recognition software; Search engines; Text analysis; Active Learning; Clustering; Machine Learning; Multiple Engines; OCR Correction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2009. ICDAR '09. 10th International Conference on
  • Conference_Location
    Barcelona
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4244-4500-4
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2009.242
  • Filename
    5277588