DocumentCode
760119
Title
Imaged document text retrieval without OCR
Author
Tan, Chew Lim ; Huang, Weihua ; Yu, Zhaohui ; Xu, Yi
Author_Institution
Sch. of Comput., Univ. of Singapore, Kent Ridge, Singapore
Volume
24
Issue
6
fYear
2002
fDate
6/1/2002 12:00:00 AM
Firstpage
838
Lastpage
844
Abstract
We propose a method for text retrieval from document images without the use of OCR. Documents are segmented into character objects. Image features, namely the vertical traverse density (VTD) and horizontal traverse density (HTD), are extracted. An n-gram-based document vector is constructed for each document based on these features. Text similarity between documents is then measured by calculating the dot product of the document vectors. Testing with seven corpora of imaged textual documents in English and Chinese as well as images from the UW1 (University of Washington 1) database confirms the validity of the proposed method
Keywords
document image processing; feature extraction; image segmentation; information retrieval; vectors; visual databases; Chinese-language text; English-language text; UW1 database; character objects; document image analysis; document segmentation; document vector dot product; horizontal traverse density; image feature extraction; imaged document text retrieval; imaged textual document corpora; n-gram-based document vector; text similarity; vertical traverse density; Computer Society; Humans; Image analysis; Image databases; Image retrieval; Image segmentation; Natural languages; Optical character recognition software; Spatial databases; Testing;
fLanguage
English
Journal_Title
Pattern Analysis and Machine Intelligence, IEEE Transactions on
Publisher
ieee
ISSN
0162-8828
Type
jour
DOI
10.1109/TPAMI.2002.1008389
Filename
1008389
Link To Document