Title :
Imaged document text retrieval without OCR
Author :
Tan, Chew Lim ; Huang, Weihua ; Yu, Zhaohui ; Xu, Yi
Author_Institution :
Sch. of Comput., Univ. of Singapore, Kent Ridge, Singapore
fDate :
6/1/2002 12:00:00 AM
Abstract :
We propose a method for text retrieval from document images without the use of OCR. Documents are segmented into character objects. Image features, namely the vertical traverse density (VTD) and horizontal traverse density (HTD), are extracted. An n-gram-based document vector is constructed for each document based on these features. Text similarity between documents is then measured by calculating the dot product of the document vectors. Testing with seven corpora of imaged textual documents in English and Chinese as well as images from the UW1 (University of Washington 1) database confirms the validity of the proposed method
Keywords :
document image processing; feature extraction; image segmentation; information retrieval; vectors; visual databases; Chinese-language text; English-language text; UW1 database; character objects; document image analysis; document segmentation; document vector dot product; horizontal traverse density; image feature extraction; imaged document text retrieval; imaged textual document corpora; n-gram-based document vector; text similarity; vertical traverse density; Computer Society; Humans; Image analysis; Image databases; Image retrieval; Image segmentation; Natural languages; Optical character recognition software; Spatial databases; Testing;
Journal_Title :
Pattern Analysis and Machine Intelligence, IEEE Transactions on
DOI :
10.1109/TPAMI.2002.1008389