Feature Extraction for Document Image Segmentation by pLSA Model

Author

Yamaguchi, Takuma ; Maruyama, Minoru

Author_Institution

Dept. of Inf. Eng., Shinshu Univ., Nagano

fYear

2008

fDate

16-19 Sept. 2008

Firstpage

53

Lastpage

60

Abstract

In this paper, we propose a method for document image segmentation based on pLSA (probabilistic latent semantic analysis) model. The pLSA model is originally developed for topic discovery in text analysis using "bag-of-words" document representation. The model is useful for image analysis by "bag-of-visual words" image representation. The performance of the method depends on the visual vocabulary generated by feature extraction from the document image. We compare several feature extraction and description methods, and examine the relations to segmentation performance. Through the experiments, we show accurate content-based document segmentation is made possible by using pLSA-based method.

Keywords

document image processing; feature extraction; image representation; image segmentation; text analysis; document image segmentation; document representation; feature extraction; image representation; probabilistic latent semantic analysis; text analysis; topic discovery; visual vocabulary; Engines; Feature extraction; Graphical models; Image analysis; Image representation; Image segmentation; Information analysis; Optical character recognition software; Text analysis; Vocabulary; document image segmentation; feature extraction; topic model;

fLanguage

English

Publisher

ieee

Conference_Titel

Document Analysis Systems, 2008. DAS '08. The Eighth IAPR International Workshop on

Conference_Location

Nara

Print_ISBN

978-0-7695-3337-7

Type

conf

DOI

10.1109/DAS.2008.48

Filename

4669945