DocumentCode
2060912
Title
Language identification of on-line documents using word shapes
Author
Nobile, Nicola ; Bergler, Sabine ; Suen, Ching Y. ; Khoury, Sami
Author_Institution
Centre for Pattern Recognition & Machine Intelligence, Concordia Univ., Montreal, Que., Canada
Volume
1
fYear
1997
fDate
18-20 Aug 1997
Firstpage
258
Abstract
The authors have extended existing methods to identify the language of an on-line document after the characters have been coded using 10 character classes based on visual characteristics. In particular, they exploit word bigrams and trigrams in both a linear combination of score values and an expert systems approach. Knowledge about each language as acquired from a large number of on-line texts. Using a small set of rules, the expert system outperforms the linear combination in accuracy and shows more stability when parameter settings are varied
Keywords
document image processing; expert systems; identification; knowledge acquisition; optical character recognition; stability; accuracy; character classes; coded characters; expert system; knowledge acquisition; language identification; linear score value combination; on-line documents; on-line texts; rules; stability; varied parameter settings; visual characteristics; word bigrams; word shapes; word trigrams; Degradation; Entropy; Expert systems; Frequency measurement; Internet; Machine intelligence; Optical character recognition software; Pattern recognition; Shape; Stability;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition, 1997., Proceedings of the Fourth International Conference on
Conference_Location
Ulm
Print_ISBN
0-8186-7898-4
Type
conf
DOI
10.1109/ICDAR.1997.619852
Filename
619852
Link To Document