• DocumentCode
    2142288
  • Title

    Browsing Heterogeneous Document Collections by a Segmentation-Free Word Spotting Method

  • Author

    Rusiñol, Marçal ; Aldavert, David ; Toledo, Ricardo ; Lladós, Josep

  • Author_Institution
    Dept. Cienc. de la Computacio, Univ. Autonoma de Barcelona, Bellaterra, Spain
  • fYear
    2011
  • fDate
    18-21 Sept. 2011
  • Firstpage
    63
  • Lastpage
    67
  • Abstract
    In this paper, we present a segmentation-free word spotting method that is able to deal with heterogeneous document image collections. We propose a patch-based framework where patches are represented by a bag-of-visual-words model powered by SIFT descriptors. A later refinement of the feature vectors is performed by applying the latent semantic indexing technique. The proposed method performs well on both handwritten and typewritten historical document images. We have also tested our method on documents written in non-Latin scripts.
  • Keywords
    document image processing; feature extraction; handwriting recognition; indexing; word processing; SIFT descriptors; bag of visual word model; feature vectors; handwritten historical document images; heterogeneous document image collections; latent semantic indexing technique; nonLatin scripts; patch based framework; segmentation free word spotting method; typewritten historical document images; Feature extraction; Hidden Markov models; Image segmentation; Indexing; Large scale integration; Semantics; Visualization; Dense SIFT Features; Heterogeneous Document Collections; Latent Semantic Indexing; Word Spotting;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2011 International Conference on
  • Conference_Location
    Beijing
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4577-1350-7
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2011.22
  • Filename
    6065277