• DocumentCode
    2836718
  • Title

    Automatic script identification from images using cluster-based templates

  • Author

    Hochberg, Judith ; Kerns, Lila ; Kelly, Patrick ; Thomas, Timothy

  • Author_Institution
    Dept. of Comput. Res., Los Alamos Nat. Lab., NM, USA
  • Volume
    1
  • fYear
    1995
  • fDate
    14-16 Aug 1995
  • Firstpage
    378
  • Abstract
    We describe a system that automatically identifies the script used in documents stored electronically in image form. The system can learn to distinguish any number of scripts. It develops a set of representative symbols (templates) for each script by clustering textual symbols from a set of training documents and representing each cluster by its centroid. “Textual symbols” include discrete characters in scripts such as Cyrillic, as well as adjoined characters, character fragments, and whole words in connected scripts such as Arabic. To identify a new document´s script, the system compares a subset of symbols from the document to each script´s templates, screening out rare or unreliable templates, and choosing the script whose templates provide the best match. Our current system, trained on thirteen scripts, correctly identifies all test documents except those printed in fonts that differ markedly from fonts in the training set
  • Keywords
    image recognition; optical character recognition; Arabic; Cyrillic; automatic script identification; centroid; character fragments; cluster-based templates; connected scripts; discrete characters; fonts; image form; representative symbols; scripts; templates; textual symbols; training documents; training set; whole words; Assembly; Degradation; Indexing; Laboratories; Natural languages; Optical character recognition software; Shape; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on
  • Conference_Location
    Montreal, Que.
  • Print_ISBN
    0-8186-7128-9
  • Type

    conf

  • DOI
    10.1109/ICDAR.1995.599017
  • Filename
    599017