DocumentCode :
2060912
Title :
Language identification of on-line documents using word shapes
Author :
Nobile, Nicola ; Bergler, Sabine ; Suen, Ching Y. ; Khoury, Sami
Author_Institution :
Centre for Pattern Recognition & Machine Intelligence, Concordia Univ., Montreal, Que., Canada
Volume :
1
fYear :
1997
fDate :
18-20 Aug 1997
Firstpage :
258
Abstract :
The authors have extended existing methods to identify the language of an on-line document after the characters have been coded using 10 character classes based on visual characteristics. In particular, they exploit word bigrams and trigrams in both a linear combination of score values and an expert systems approach. Knowledge about each language as acquired from a large number of on-line texts. Using a small set of rules, the expert system outperforms the linear combination in accuracy and shows more stability when parameter settings are varied
Keywords :
document image processing; expert systems; identification; knowledge acquisition; optical character recognition; stability; accuracy; character classes; coded characters; expert system; knowledge acquisition; language identification; linear score value combination; on-line documents; on-line texts; rules; stability; varied parameter settings; visual characteristics; word bigrams; word shapes; word trigrams; Degradation; Entropy; Expert systems; Frequency measurement; Internet; Machine intelligence; Optical character recognition software; Pattern recognition; Shape; Stability;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition, 1997., Proceedings of the Fourth International Conference on
Conference_Location :
Ulm
Print_ISBN :
0-8186-7898-4
Type :
conf
DOI :
10.1109/ICDAR.1997.619852
Filename :
619852
Link To Document :
بازگشت