• DocumentCode
    3019487
  • Title

    Page segmentation for Manhattan and non-Manhattan layout documents via selective CRLA

  • Author

    Sun, Hung-Ming

  • Author_Institution
    Dept. of Inf. Manage., Kainan Univ., Taoyuan, Taiwan
  • fYear
    2005
  • fDate
    29 Aug.-1 Sept. 2005
  • Firstpage
    116
  • Abstract
    The constrained run-length algorithm (CRLA) is a well-known technique for page segmentation. The algorithm is fast and can be used to partition documents with Manhattan layouts. It is not, however, suited to deal with pages with layouts beyond the Manhattan format, e.g. irregular halftone images embedded in text paragraphs. A modified version of the CRLA, named selective CRLA, is presented in this paper. The selective CRLA is capable of processing documents with both Manhattan and non-Manhattan layouts. The selective CRLA is performed twice with different sets of parameters on a label image derived from the input document image. After both of its executions, the yielded text regions are extracted. The proposed method has been successfully applied to extraction of text from commercial magazine pages with complicated layouts.
  • Keywords
    document image processing; feature extraction; image segmentation; runlength codes; text analysis; Manhattan layout document; commercial magazine page text extraction; document image processing; document partitioning; nonManhattan layout document; page segmentation; selective constrained run-length algorithm; Graphics; Image coding; Image segmentation; Information management; Labeling; Layout; Optical character recognition software; Partitioning algorithms; Sun; Text analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on
  • ISSN
    1520-5263
  • Print_ISBN
    0-7695-2420-6
  • Type

    conf

  • DOI
    10.1109/ICDAR.2005.185
  • Filename
    1575521