• DocumentCode
    478629
  • Title

    Re-targetable OCR with Intelligent Character Segmentation

  • Author

    Agrawal, Mudit ; Doermann, David

  • fYear
    2008
  • fDate
    16-19 Sept. 2008
  • Firstpage
    183
  • Lastpage
    190
  • Abstract
    We have developed a font-model based intelligent character segmentation and recognition system. Using characteristics of structurally similar TrueType fonts, our system automatically builds a model to be used for the segmentation and recognition of the new script, independent of glyph composition. The key is a reliance on known font attributes. In our system three feature extraction methods are used to demonstrate the importance of appropriate features for classification. The methods are tested on both Latin (English) and non-Latin (Khmer) scripts. Results show that the character-level recognition accuracy exceeds 92\\% for Khmer and 96\\% for English on degraded documents. This work is a step toward the recognition of scripts of low-density languages which typically do not warrant the development of commercial OCR, yet often have complete TrueType font descriptions.
  • Keywords
    Data mining; Databases; Finance; Humans; Information analysis; Neural networks; Optical character recognition software; Tagging; Text analysis; Retargetable intelligent character segmentation syllabic scripts Khmer;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis Systems, 2008. DAS '08. The Eighth IAPR International Workshop on
  • Conference_Location
    Nara, Japan
  • Print_ISBN
    978-0-7695-3337-7
  • Type

    conf

  • DOI
    10.1109/DAS.2008.67
  • Filename
    4669960