Title :
Script line separation from Indian multi-script documents
Author :
Pal, U. ; Chaudhuri, B.B.
Author_Institution :
Comput. Vision & Pattern Recognition Unit, Indian Stat. Inst., Calcutta, India
Abstract :
In a multi-lingual country like India, a document page may contain more than one script form. Under the three-language formula, the document may be printed in English, Devnagari and one of the other official Indian languages. For OCR of such a document page, it is necessary to separate these three script forms before feeding them to the OCRs of individual scripts. In this paper, an automatic technique of separating the text lines using script characteristics and shape based features is presented. At present, the system has an overall accuracy of about 98.5%
Keywords :
document image processing; image segmentation; optical character recognition; Devnagari; English; Indian languages; Indian multi-script documents; OCR; document page; script form; script line separation; shape based features; text lines; three-language formula; Character generation; Computer vision; Natural languages; Optical character recognition software; Optical filters; Pattern recognition; Read only memory; Shape; Writing;
Conference_Titel :
Document Analysis and Recognition, 1999. ICDAR '99. Proceedings of the Fifth International Conference on
Conference_Location :
Bangalore
Print_ISBN :
0-7695-0318-7
DOI :
10.1109/ICDAR.1999.791810