• DocumentCode
    760119
  • Title

    Imaged document text retrieval without OCR

  • Author

    Tan, Chew Lim ; Huang, Weihua ; Yu, Zhaohui ; Xu, Yi

  • Author_Institution
    Sch. of Comput., Univ. of Singapore, Kent Ridge, Singapore
  • Volume
    24
  • Issue
    6
  • fYear
    2002
  • fDate
    6/1/2002 12:00:00 AM
  • Firstpage
    838
  • Lastpage
    844
  • Abstract
    We propose a method for text retrieval from document images without the use of OCR. Documents are segmented into character objects. Image features, namely the vertical traverse density (VTD) and horizontal traverse density (HTD), are extracted. An n-gram-based document vector is constructed for each document based on these features. Text similarity between documents is then measured by calculating the dot product of the document vectors. Testing with seven corpora of imaged textual documents in English and Chinese as well as images from the UW1 (University of Washington 1) database confirms the validity of the proposed method
  • Keywords
    document image processing; feature extraction; image segmentation; information retrieval; vectors; visual databases; Chinese-language text; English-language text; UW1 database; character objects; document image analysis; document segmentation; document vector dot product; horizontal traverse density; image feature extraction; imaged document text retrieval; imaged textual document corpora; n-gram-based document vector; text similarity; vertical traverse density; Computer Society; Humans; Image analysis; Image databases; Image retrieval; Image segmentation; Natural languages; Optical character recognition software; Spatial databases; Testing;
  • fLanguage
    English
  • Journal_Title
    Pattern Analysis and Machine Intelligence, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0162-8828
  • Type

    jour

  • DOI
    10.1109/TPAMI.2002.1008389
  • Filename
    1008389