Imaged document text retrieval without OCR

Author

Tan, Chew Lim ; Huang, Weihua ; Yu, Zhaohui ; Xu, Yi

Author_Institution

Sch. of Comput., Univ. of Singapore, Kent Ridge, Singapore

Volume

24

Issue

6

fYear

2002

fDate

6/1/2002 12:00:00 AM

Firstpage

838

Lastpage

844

Abstract

We propose a method for text retrieval from document images without the use of OCR. Documents are segmented into character objects. Image features, namely the vertical traverse density (VTD) and horizontal traverse density (HTD), are extracted. An n-gram-based document vector is constructed for each document based on these features. Text similarity between documents is then measured by calculating the dot product of the document vectors. Testing with seven corpora of imaged textual documents in English and Chinese as well as images from the UW1 (University of Washington 1) database confirms the validity of the proposed method

Keywords

document image processing; feature extraction; image segmentation; information retrieval; vectors; visual databases; Chinese-language text; English-language text; UW1 database; character objects; document image analysis; document segmentation; document vector dot product; horizontal traverse density; image feature extraction; imaged document text retrieval; imaged textual document corpora; n-gram-based document vector; text similarity; vertical traverse density; Computer Society; Humans; Image analysis; Image databases; Image retrieval; Image segmentation; Natural languages; Optical character recognition software; Spatial databases; Testing;

fLanguage

English

Journal_Title

Pattern Analysis and Machine Intelligence, IEEE Transactions on

Publisher

ieee

ISSN

0162-8828

Type

jour

DOI

10.1109/TPAMI.2002.1008389

Filename

1008389