• DocumentCode
    1522808
  • Title

    An omnifont open-vocabulary OCR system for English and Arabic

  • Author

    Bazzi, Issam ; Schwartz, Richard ; Makhoul, John

  • Author_Institution
    BBN Syst. & Technol. Corp., Cambridge, MA, USA
  • Volume
    21
  • Issue
    6
  • fYear
    1999
  • fDate
    6/1/1999 12:00:00 AM
  • Firstpage
    495
  • Lastpage
    504
  • Abstract
    We present an omnifont, unlimited-vocabulary OCR system for English and Arabic. The system is based on hidden Markov models (HMM), an approach that has proven to be very successful in the area of automatic speech recognition. We focus on two aspects of the OCR system. First, we address the issue of how to perform OCR on omnifont and multi-style data, such as plain and italic, without the need to have a separate model for each style. The amount of training data from each style, which is used to train a single model, becomes an important issue in the face of the conditional independence assumption inherent in the use of HMMs. We demonstrate mathematically and empirically how to allocate training data among the different styles to alleviate this problem. Second, we show how to use a word-based HMM system to perform character recognition with unlimited vocabulary. The method includes the use of a trigram language model on character sequences. Using all these techniques, we have achieved character error rates of 1.1 percent on data from the University of Washington English Document Image Database and 3.3 percent on data from the DARPA Arabic OCR Corpus
  • Keywords
    feature extraction; hidden Markov models; optical character recognition; probability; Arabic; DARPA Arabic OCR Corpus; English; University of Washington English Document Image Database; character sequences; omnifont open-vocabulary OCR system; training data; trigram language model; word-based HMM system; Automatic speech recognition; Character recognition; Error analysis; Handwriting recognition; Hidden Markov models; Natural languages; Optical character recognition software; Speech recognition; Training data; Vocabulary;
  • fLanguage
    English
  • Journal_Title
    Pattern Analysis and Machine Intelligence, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0162-8828
  • Type

    jour

  • DOI
    10.1109/34.771314
  • Filename
    771314