DocumentCode :
2530130
Title :
Tools for enabling digital access to multi-lingual Indic documents
Author :
Govindaraju, Venu ; Khedekar, Swapnil ; Kompalli, Suryaprakash ; Farooq, Faisal ; Setlur, Srirangaraj ; Vemulapati, Ramanaprasad
Author_Institution :
Univ. at Buffalo, NY, USA
fYear :
2004
fDate :
2004
Firstpage :
122
Lastpage :
133
Abstract :
We present methodologies for three important tasks that will eventually enable digital access of multilingual Indian document images. First, we describe several document image analysis techniques necessary to prepare Devanagari document images for OCR. The second task is OCR for machine printed Devanagari words without the help of a lexicon. We describe the OCR methodology and show how it is being extended to other Indian languages. Finally, we describe a versatile platform that facilitates automatic segmentation of document images in multiple Indian languages and an interface to capture the ground truth corresponding to the text. We use transliterated English text and virtual keyboards in a range of Indian languages for this purpose. The multilingual data entry capabilities of the tool and its underlying UNICODE data representation within a structured XML document also allow users to annotate passages of text in one language in other languages using a markup scheme to switch between scripts. Text and annotations are rendered in the appropriate scripts as the text is being annotated, thus providing users prompt and natural feedback. The XML back-end allows meta-data to be recorded describing the annotated document.
Keywords :
XML; data structures; document image processing; image segmentation; meta data; natural languages; optical character recognition; text analysis; user interfaces; Devanagari document image; Indian language; OCR; UNICODE data representation; XML document; document image analysis; meta-data; multilingual Indian document image; multilingual Indic document; transliterated English text; virtual keyboard; Conferences; Image analysis; Software libraries; Text analysis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Image Analysis for Libraries, 2004. Proceedings. First International Workshop on
Print_ISBN :
0-7695-2088-X
Type :
conf
DOI :
10.1109/DIAL.2004.1263244
Filename :
1263244
Link To Document :
بازگشت