• DocumentCode
    2146220
  • Title

    Text Segmentation of Consumer Magazines in PDF Format

  • Author

    Fan, Jian

  • Author_Institution
    Hewlett-Packard Labs., Palo Alto, CA, USA
  • fYear
    2011
  • fDate
    18-21 Sept. 2011
  • Firstpage
    794
  • Lastpage
    798
  • Abstract
    Text segmentation is usually the first step taken towards the reuse and repurposing of PDF documents. Through experimental evaluation, we found that the leading text segmentation algorithms have limitations for contemporary consumer magazines. We propose a new local homogeneity measure based on line space, and incorporate this new feature into a region growing algorithm. Using a fixed set of parameters, our algorithm achieved robust performance on PDF magazines with wide-ranging layouts and styles.
  • Keywords
    document handling; text analysis; PDF document; PDF format; PDF magazine; contemporary consumer magazine; line space; local homogeneity measure; parameter set; region growing algorithm; text segmentation algorithm; Bismuth; Extraterrestrial measurements; Layout; Merging; Portable document format; Rendering (computer graphics); PDF analysis; page segmentation; text segmentation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2011 International Conference on
  • Conference_Location
    Beijing
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4577-1350-7
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2011.163
  • Filename
    6065420