DocumentCode
1636932
Title
High Performance Chinese/English Mixed OCR with Character Level Language Identification
Author
Wang, Kai ; Jin, Jianming ; Wang, Qingren
Author_Institution
Inst. of Machine Intell., Nankai Univ., Tianjin, China
fYear
2009
Firstpage
406
Lastpage
410
Abstract
Currently, there have been several high performance OCR products for Chinese or for English. However, no one OCR technique can be simultaneously fit for both the English and the Chinese due to the large differences between Chinese and English. On the other hand, Chinese/English mixed document increases drastically with the globalization, so it is rather important to study the Chinese/English mixed document processing. Obviously, the key problem to resolve is how to split the mixed document into two parts: Chinese part and English part, so that the different OCR techniques can be applied to different parts. To further improve the previous system performance, a novel Chinese/English split algorithm based on global information is proposed and a rule for language identification is achieved by Bayesian formula. Experiment shows, the system error rate drops from 1.52% to 0.87% on magazine samples and from 1.32% to 0.75% on book samples, more than 2/5 of errors are excluded, which provides an experimental support for our research work.
Keywords
Bayes methods; document image processing; error statistics; image segmentation; natural languages; optical character recognition; Bayesian formula; Chinese/English split algorithm; book sample; character segmentation; character-level language identification rule; document image processing; error rate; global information; high-performance Chinese/English mixed OCR technique; magazine sample; Bayesian methods; Character recognition; Document image processing; Globalization; Machine intelligence; Natural languages; Optical character recognition software; Performance analysis; System performance; Text analysis; Language Identification; Multi-lingual OCR; document image processing; optical character recognition;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition, 2009. ICDAR '09. 10th International Conference on
Conference_Location
Barcelona
ISSN
1520-5363
Print_ISBN
978-1-4244-4500-4
Electronic_ISBN
1520-5363
Type
conf
DOI
10.1109/ICDAR.2009.14
Filename
5277652
Link To Document