• DocumentCode
    3340098
  • Title

    Feature Extraction for Document Image Segmentation by pLSA Model

  • Author

    Yamaguchi, Takuma ; Maruyama, Minoru

  • Author_Institution
    Dept. of Inf. Eng., Shinshu Univ., Nagano
  • fYear
    2008
  • fDate
    16-19 Sept. 2008
  • Firstpage
    53
  • Lastpage
    60
  • Abstract
    In this paper, we propose a method for document image segmentation based on pLSA (probabilistic latent semantic analysis) model. The pLSA model is originally developed for topic discovery in text analysis using "bag-of-words" document representation. The model is useful for image analysis by "bag-of-visual words" image representation. The performance of the method depends on the visual vocabulary generated by feature extraction from the document image. We compare several feature extraction and description methods, and examine the relations to segmentation performance. Through the experiments, we show accurate content-based document segmentation is made possible by using pLSA-based method.
  • Keywords
    document image processing; feature extraction; image representation; image segmentation; text analysis; document image segmentation; document representation; feature extraction; image representation; probabilistic latent semantic analysis; text analysis; topic discovery; visual vocabulary; Engines; Feature extraction; Graphical models; Image analysis; Image representation; Image segmentation; Information analysis; Optical character recognition software; Text analysis; Vocabulary; document image segmentation; feature extraction; topic model;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis Systems, 2008. DAS '08. The Eighth IAPR International Workshop on
  • Conference_Location
    Nara
  • Print_ISBN
    978-0-7695-3337-7
  • Type

    conf

  • DOI
    10.1109/DAS.2008.48
  • Filename
    4669945