DocumentCode :
2566853
Title :
Monothetic separation of Telugu, Hindi and English text lines from a multi script document
Author :
Padma, M.C. ; Vijaya, P.A.
Author_Institution :
Dept. of E. & C. Eng., Malnad Coll. of Eng., Hassan, India
fYear :
2009
fDate :
11-14 Oct. 2009
Firstpage :
4870
Lastpage :
4875
Abstract :
In a multi-script multi-lingual environment, a document may contain text lines in more than one script/language forms. It is necessary to identify different script regions of the document in order to feed the document to the OCRs of individual language. With this context, this paper proposes to develop a monothetic algorithmic model to identify and separate text lines Telugu, Hindi and English scripts from a printed multilingual document. The proposed method uses the distinct features of the target script and searches for the text lines that possess the anticipated features. Experimentation conducted involved 1500 text lines for learning and 900 text lines for testing. The performance has turned out to be 98.5%.
Keywords :
document image processing; optical character recognition; text analysis; English text line; monothetic algorithm; monothetic separation; multi script document; multilingual document; optical character recognition; script/language form; Context modeling; Cybernetics; Educational institutions; Feeds; Image analysis; Natural languages; Optical character recognition software; Text analysis; Text recognition; USA Councils; Feature extraction; Monothetic Classifier; Multi-script multi-lingual document; Script Identification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Systems, Man and Cybernetics, 2009. SMC 2009. IEEE International Conference on
Conference_Location :
San Antonio, TX
ISSN :
1062-922X
Print_ISBN :
978-1-4244-2793-2
Electronic_ISBN :
1062-922X
Type :
conf
DOI :
10.1109/ICSMC.2009.5346045
Filename :
5346045
Link To Document :
بازگشت