DocumentCode :
2483805
Title :
Stop word detection in compressed textual images: An experiment on indic script documents
Author :
Garain, Utpal ; Das, Amit Kumar
Author_Institution :
CVPR Unit, ISI, Kolkata
fYear :
2008
fDate :
8-11 Dec. 2008
Firstpage :
1
Lastpage :
4
Abstract :
Stop word detection is attempted in this work in the context of retrieval of document images in the compressed domain. Algorithms are presented to identify text lines and words and to cluster similar words to count word occurrence frequencies. A list of words with their occurrence frequencies is generated from a corpus of textual images. As stop words in any language show high occurrence frequencies, such words occupy the upper positions in the sorted word list. Experiments have been carried out on two major indic scripts (Devanagari (Hindi) and Bangla). Test results using 150 document images consisting of about 12 K words in each script show the promising potential of the proposed approach.
Keywords :
data compression; document image processing; image retrieval; image texture; natural languages; pattern clustering; text analysis; compressed textual image; document image retrieval; indic script document; stop word detection; text word identification; word cluster; word occurrence frequency; Clustering algorithms; Dictionaries; Frequency; Image coding; Image retrieval; Intersymbol interference; Labeling; Prototypes; Scattering; Streaming media;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Pattern Recognition, 2008. ICPR 2008. 19th International Conference on
Conference_Location :
Tampa, FL
ISSN :
1051-4651
Print_ISBN :
978-1-4244-2174-9
Electronic_ISBN :
1051-4651
Type :
conf
DOI :
10.1109/ICPR.2008.4761529
Filename :
4761529
Link To Document :
بازگشت