Title :
Parameter-free geometric document layout analysis
Author :
Lee, Seong-Whan ; Ryu, Dae-Seok
Author_Institution :
Center for Artificial Vision Res., Korea Univ., Seoul, South Korea
fDate :
11/1/2001 12:00:00 AM
Abstract :
Automatic transformation of paper documents into electronic documents requires geometric document layout analysis at the first stage. However, variations in character font sizes, text line spacing, and document layout structures have made it difficult to design a general-purpose document layout analysis algorithm for many years. The use of some parameters has therefore been unavoidable in previous methods. The authors propose a parameter-free method for segmenting the document images into maximal homogeneous regions and identifying them as texts, images, tables, and ruling lines. A pyramidal quadtree structure is constructed for multiscale analysis and a periodicity measure is suggested to find a periodical attribute of text regions for page segmentation. To obtain robust page segmentation results, a confirmation procedure using texture analysis is applied to only ambiguous regions. Based on the proposed periodicity measure, multiscale analysis, and confirmation procedure, we could develop a robust method for geometric document layout analysis independent of character font sizes, text line spacing, and document layout structures. The proposed method was experimented with the document database from the University of Washington and the MediaTeam Document Database. The results of these tests have shown that the proposed method provides more accurate results than previous ones
Keywords :
document image processing; image segmentation; quadtrees; text analysis; MediaTeam Document Database; ambiguous regions; automatic transformation; character font sizes; confirmation procedure; document image segmentation; document layout structures; electronic documents; general-purpose document layout analysis algorithm; geometric document layout analysis; maximal homogeneous regions; multiscale analysis; page segmentation; paper documents; parameter-free geometric document layout analysis; parameter-free method; periodical attribute; periodicity measure; pyramidal quadtree structure; robust method; ruling lines; text line spacing; text regions; texture analysis; Algorithm design and analysis; Image analysis; Image segmentation; Information analysis; Information technology; Performance analysis; Robustness; Size measurement; Spatial databases; Text analysis;
Journal_Title :
Pattern Analysis and Machine Intelligence, IEEE Transactions on