• DocumentCode
    2186987
  • Title

    Omnifont and unlimited-vocabulary OCR for English and Arabic

  • Author

    Bazzi, Issam ; LaPre, Chris ; Makhoul, John ; Raphael, Chris ; Schwartz, Richard

  • Author_Institution
    BBN Corp., Cambridge, MA, USA
  • Volume
    2
  • fYear
    1997
  • fDate
    18-20 Aug 1997
  • Firstpage
    842
  • Abstract
    The authors present a set of techniques for omnifont, unlimited-vocabulary OCR, within the context of a system based on hidden Markov models (HMM). First, they address the issue of how to perform OCR on omnifont and multi-style data, such as plain and italic, without the need to have a separate model for each style. The amount of training data from each style, which is used to train a single model, becomes an important issue in the face of the conditional independence assumption inherent in the use of HMMs. They demonstrate mathematically and empirically how to allocate training data among the different styles to alleviate this problem. Second, they show how to use a word-based HMM system to perform character recognition with unlimited vocabulary. The method includes the use of a trigram language model on character sequences. Using all these techniques, they have achieved character error rates of 1.1% on data from the University of Washington English Document Image Database and 3.3% on data from the DARPA Arabic OCR Corpus
  • Keywords
    character sets; document image processing; hidden Markov models; optical character recognition; sequences; speech recognition; Arabic; DARPA Arabic OCR Corpus; English; University of Washington English Document Image Database; character error rates; character recognition; character sequences; conditional independence assumption; hidden Markov models; italic data; omnifont OCR; plain data; training data; training data allocation; trigram language model; unlimited-vocabulary OCR; word-based HMM system; Automatic speech recognition; Character recognition; Data mining; Feature extraction; Hidden Markov models; Natural languages; Optical character recognition software; Speech recognition; Training data; Vocabulary;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 1997., Proceedings of the Fourth International Conference on
  • Conference_Location
    Ulm
  • Print_ISBN
    0-8186-7898-4
  • Type

    conf

  • DOI
    10.1109/ICDAR.1997.620630
  • Filename
    620630