DocumentCode :
3023204
Title :
An approach for stemming in symbolically compressed Indian language imaged documents
Author :
Garain, Utpal ; Datta, Alok Kumar
Author_Institution :
Comput. Vision & Pattern Recognition Unit, Indian Stat. Inst., Kolkata, India
fYear :
2005
fDate :
29 Aug.-1 Sept. 2005
Firstpage :
1080
Abstract :
Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots, and thereby improving the overall retrieval efficiency. This paper presents an algorithm for stemming in the context of document image retrieval system. The algorithm assumes that the documents are symbolically compressed and stemming has been attempted in the compressed domain itself. Experiments have been conducted on Indian language imaged documents for which efficient OCR still remains a challenging task. Results obtained from a set 150 document images (in Bangla script, the second most popular script in the Indian sub-continent) consisting of about 12K word show a promising performance of the proposed approach.
Keywords :
document handling; image retrieval; natural languages; optical character recognition; Bangla script; Indian language; compressed documents; document image retrieval system; information retrieval system; optical character recognition; stemming algorithm; Character recognition; Computer vision; Image coding; Image retrieval; Image storage; Information retrieval; Internet; Optical character recognition software; Pattern recognition; Search engines;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on
ISSN :
1520-5263
Print_ISBN :
0-7695-2420-6
Type :
conf
DOI :
10.1109/ICDAR.2005.45
Filename :
1575710
Link To Document :
بازگشت