• DocumentCode
    2060912
  • Title

    Language identification of on-line documents using word shapes

  • Author

    Nobile, Nicola ; Bergler, Sabine ; Suen, Ching Y. ; Khoury, Sami

  • Author_Institution
    Centre for Pattern Recognition & Machine Intelligence, Concordia Univ., Montreal, Que., Canada
  • Volume
    1
  • fYear
    1997
  • fDate
    18-20 Aug 1997
  • Firstpage
    258
  • Abstract
    The authors have extended existing methods to identify the language of an on-line document after the characters have been coded using 10 character classes based on visual characteristics. In particular, they exploit word bigrams and trigrams in both a linear combination of score values and an expert systems approach. Knowledge about each language as acquired from a large number of on-line texts. Using a small set of rules, the expert system outperforms the linear combination in accuracy and shows more stability when parameter settings are varied
  • Keywords
    document image processing; expert systems; identification; knowledge acquisition; optical character recognition; stability; accuracy; character classes; coded characters; expert system; knowledge acquisition; language identification; linear score value combination; on-line documents; on-line texts; rules; stability; varied parameter settings; visual characteristics; word bigrams; word shapes; word trigrams; Degradation; Entropy; Expert systems; Frequency measurement; Internet; Machine intelligence; Optical character recognition software; Pattern recognition; Shape; Stability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 1997., Proceedings of the Fourth International Conference on
  • Conference_Location
    Ulm
  • Print_ISBN
    0-8186-7898-4
  • Type

    conf

  • DOI
    10.1109/ICDAR.1997.619852
  • Filename
    619852