Title :
Character N-Gram Spotting on Handwritten Documents Using Weakly-Supervised Segmentation
Author :
Roy, Utpal ; Sankaran, Naveen ; Sankar, K. Pramod ; Jawahar, C.V.
Author_Institution :
Center for Visual Inf. Technol., IIIT Hyderabad, Hyderabad, India
Abstract :
In this paper, we present a solution towards building a retrieval system over handwritten document images that i) is recognition-free, ii) allows text-querying, iii) can retrieve at sub-word level, iv) can search for out-of-vocabulary words. Unlike previous approaches that operate at either character or word levels, we use character n-gram images (CNG-img) as the retrieval primitive. CNG-img are sequences of character segments, that are represented and matched in the image-space. The word-images are now treated as a bag-of-CNG-img, that can be indexed and matched in the feature space. This allows for recognition-free search (query-by-example), which can retrieve morphologically similar words that have matching sub-words. Further, to enable query-by-keyword, we build an automated scheme to generate labeled exemplars for characters and character n-grams, from unconstrained handwritten documents. We pose this problem as one of weakly-supervised learning, where character/n-gram labeling is obtained automatically from the word labels. The resulting retrieval system can answer queries from an unlimited. vocabulary. The approach is demonstrated on the George Washington collection, results show major improvement in retrieval performance as compared to word-recognition and word-spotting methods.
Keywords :
handwritten character recognition; image matching; image representation; image retrieval; image segmentation; CNG-img retrieval primitive; George Washington collection; character N-gram spotting; handwritten document image; handwritten documents; image retrieval system; image-space matching; image-space representation; out-of-vocabulary words; query-by-example; query-by-keyword; recognition-free system; sub-word level retrieval; text-querying; weakly-supervised segmentation; word-recognition method; word-spotting method; Character recognition; Handwriting recognition; Hidden Markov models; Image segmentation; Indexes; Labeling; Training;
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/ICDAR.2013.120