DocumentCode
3340098
Title
Feature Extraction for Document Image Segmentation by pLSA Model
Author
Yamaguchi, Takuma ; Maruyama, Minoru
Author_Institution
Dept. of Inf. Eng., Shinshu Univ., Nagano
fYear
2008
fDate
16-19 Sept. 2008
Firstpage
53
Lastpage
60
Abstract
In this paper, we propose a method for document image segmentation based on pLSA (probabilistic latent semantic analysis) model. The pLSA model is originally developed for topic discovery in text analysis using "bag-of-words" document representation. The model is useful for image analysis by "bag-of-visual words" image representation. The performance of the method depends on the visual vocabulary generated by feature extraction from the document image. We compare several feature extraction and description methods, and examine the relations to segmentation performance. Through the experiments, we show accurate content-based document segmentation is made possible by using pLSA-based method.
Keywords
document image processing; feature extraction; image representation; image segmentation; text analysis; document image segmentation; document representation; feature extraction; image representation; probabilistic latent semantic analysis; text analysis; topic discovery; visual vocabulary; Engines; Feature extraction; Graphical models; Image analysis; Image representation; Image segmentation; Information analysis; Optical character recognition software; Text analysis; Vocabulary; document image segmentation; feature extraction; topic model;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis Systems, 2008. DAS '08. The Eighth IAPR International Workshop on
Conference_Location
Nara
Print_ISBN
978-0-7695-3337-7
Type
conf
DOI
10.1109/DAS.2008.48
Filename
4669945
Link To Document