• DocumentCode
    1634740
  • Title

    A Self-Adaptive Method for Extraction of Document-Specific Alphabets

  • Author

    Pletschacher, Stefan

  • Author_Institution
    Pattern Recognition & Image Anal. (PRImA) Res. Lab., Univ. of Salford, Salford, UK
  • fYear
    2009
  • Firstpage
    656
  • Lastpage
    660
  • Abstract
    Recognition and encoding of digitized historical documents is still a challenging and difficult task. A major problem is the occurrence of unknown glyphs and symbols which might not even exist in modern alphabets. Current pre-trained OCR-methods hardly deliver usable results for such documents. This paper describes an alternative approach and framework for handling printed historical documents without restrictions on the contained alphabets or fonts. The basic idea is to derive all information required for encoding directly from the document itself. This is achieved by extracting a document-specific prototype alphabet of locatable glyphs. Core of the system is a customized clustering method which adapts automatically to new documents by ascertaining appropriate threshold parameters based on the special characteristics of glyphs. This way, the system is able to run without manual interventions and can be integrated into automated mass digitization workflows.
  • Keywords
    document image processing; encoding; feature extraction; history; optical character recognition; pattern clustering; ancient glyph; automated mass digitization workflow; customized clustering method; digitized historical document encoding; digitized historical document recognition; document-specific prototype alphabet extraction; feature extraction; pre-trained OCR-method; printed historical document handling; self-adaptive method; threshold parameter; Character recognition; Data mining; Dictionaries; Document handling; Encoding; Image analysis; Image recognition; Optical character recognition software; Pattern recognition; Prototypes; Clustering; Encoding; Mass Digitization; OCR;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2009. ICDAR '09. 10th International Conference on
  • Conference_Location
    Barcelona
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4244-4500-4
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2009.253
  • Filename
    5277564