• DocumentCode
    2016549
  • Title

    Unsupervised Newspaper Segmentation Using Language Context

  • Author

    Furmaniak, Ralph

  • Author_Institution
    Univ. of Waterloo, Waterloo
  • Volume
    2
  • fYear
    2007
  • fDate
    23-26 Sept. 2007
  • Firstpage
    1263
  • Lastpage
    1267
  • Abstract
    There has been increased interest in digitization of newspaper archives. A major problem that must be solved is that of high accuracy decomposition of the page into its logical structure. In this paper I present an approach that uses a language similarity measure based on OCR results to train geometric layout rules tailored to an arbitrary title. Experiments have shown this approach to be very effective.
  • Keywords
    image segmentation; information retrieval systems; natural languages; optical character recognition; publishing; unsupervised learning; OCR result; arbitrary title; geometric layout rule training; language similarity measure; newspaper archive digitization; unsupervised newspaper segmentation; Books; Engines; Image processing; Image segmentation; Merging; Optical character recognition software; Testing; Text recognition;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on
  • Conference_Location
    Parana
  • ISSN
    1520-5363
  • Print_ISBN
    978-0-7695-2822-9
  • Type

    conf

  • DOI
    10.1109/ICDAR.2007.4377118
  • Filename
    4377118