Title :
Text Segmentation of Consumer Magazines in PDF Format
Author_Institution :
Hewlett-Packard Labs., Palo Alto, CA, USA
Abstract :
Text segmentation is usually the first step taken towards the reuse and repurposing of PDF documents. Through experimental evaluation, we found that the leading text segmentation algorithms have limitations for contemporary consumer magazines. We propose a new local homogeneity measure based on line space, and incorporate this new feature into a region growing algorithm. Using a fixed set of parameters, our algorithm achieved robust performance on PDF magazines with wide-ranging layouts and styles.
Keywords :
document handling; text analysis; PDF document; PDF format; PDF magazine; contemporary consumer magazine; line space; local homogeneity measure; parameter set; region growing algorithm; text segmentation algorithm; Bismuth; Extraterrestrial measurements; Layout; Merging; Portable document format; Rendering (computer graphics); PDF analysis; page segmentation; text segmentation;
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4577-1350-7
Electronic_ISBN :
1520-5363
DOI :
10.1109/ICDAR.2011.163