DocumentCode
2146220
Title
Text Segmentation of Consumer Magazines in PDF Format
Author
Fan, Jian
Author_Institution
Hewlett-Packard Labs., Palo Alto, CA, USA
fYear
2011
fDate
18-21 Sept. 2011
Firstpage
794
Lastpage
798
Abstract
Text segmentation is usually the first step taken towards the reuse and repurposing of PDF documents. Through experimental evaluation, we found that the leading text segmentation algorithms have limitations for contemporary consumer magazines. We propose a new local homogeneity measure based on line space, and incorporate this new feature into a region growing algorithm. Using a fixed set of parameters, our algorithm achieved robust performance on PDF magazines with wide-ranging layouts and styles.
Keywords
document handling; text analysis; PDF document; PDF format; PDF magazine; contemporary consumer magazine; line space; local homogeneity measure; parameter set; region growing algorithm; text segmentation algorithm; Bismuth; Extraterrestrial measurements; Layout; Merging; Portable document format; Rendering (computer graphics); PDF analysis; page segmentation; text segmentation;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location
Beijing
ISSN
1520-5363
Print_ISBN
978-1-4577-1350-7
Electronic_ISBN
1520-5363
Type
conf
DOI
10.1109/ICDAR.2011.163
Filename
6065420
Link To Document