• DocumentCode
    1063822
  • Title

    N-tuple features for OCR revisited

  • Author

    Jung, Dz Mou ; Krishnamoorthy, M.S. ; Nagy, George ; Shapira, Andrew

  • Author_Institution
    Caere Corp., Los Gatos, CA, USA
  • Volume
    18
  • Issue
    7
  • fYear
    1996
  • fDate
    7/1/1996 12:00:00 AM
  • Firstpage
    734
  • Lastpage
    745
  • Abstract
    N-tuple features for optical character recognition have received only scattered attention since the 1960s. Our main purpose is to show that advances in computer technology and computer science compel renewed interest. N-tuple features are useful for printed character classification because they indicate the presence or absence of a given rigid configuration of n black and white pixels in a pattern. Desirable n-tuples fit each pattern of a specified (positive) training set of characters in at least p different shift positions, and fail to fit each pattern of a specified (negative) training set by at least n-q pixels in each shift position. We prove that the problem of finding a distinguishing n-tuple is NP-complete, by examining a natural subproblem with binary strings called the missing configuration problem. The NP-completeness result notwithstanding, distinguishing n-tuples are found automatically in a few seconds on contemporary workstations. We exhibit a practical search algorithm for generating, from a small training set, a collection of n-tuples with low class-conditional correlation and with specified design parameters n, p, and q. The generator, which is available on the Internet, is empirically shown to be effective through a comparison with a benchmark generator. We show experimentally that the design parameters provide a useful tradeoff between distinguishing power and generation time, and also between the conditional probabilities for the positive and negative classes. We explore the feature probabilities obtainable for various dichotomies, and show that the design parameters control the feature probabilities
  • Keywords
    backtracking; computational complexity; decision theory; image classification; optical character recognition; probability; search problems; NP-complete; OCR; binary strings; feature probabilities; missing configuration problem; n-tuple features; optical character recognition; printed character classification; search algorithm; Character recognition; Circuits; Decision making; Electronic mail; Feature extraction; Handwriting recognition; Optical character recognition software; Optical sensors; Sea measurements; Software libraries;
  • fLanguage
    English
  • Journal_Title
    Pattern Analysis and Machine Intelligence, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0162-8828
  • Type

    jour

  • DOI
    10.1109/34.506795
  • Filename
    506795