• DocumentCode
    3023943
  • Title

    Identification of document structure and table of content in magazine archives

  • Author

    Yacoub, Sherif ; Peiro, Jose Abad

  • Author_Institution
    Hewlett-Packard Espanola, Barcelona, Spain
  • fYear
    2005
  • fDate
    29 Aug.-1 Sept. 2005
  • Firstpage
    1253
  • Abstract
    In this paper, we present a generic approach for reliable identification of the table of content (TOC) pages in scanned documents. We use multiple sources of information to obtain a reliable assessment of the TOC pages and the position of articles. These sources are produced by using three methods: title matching, section keyword matching, and numeric content. Finally a combination component is used to score potential TOC pages and select the best candidates. The system is used to identify the table of content, locate the beginning of articles, aid the process of advertisement identification (where present), and in general, identify the structure of scanned documents for the process of article extraction and online deployment of digital content. Results of applying the algorithms to an 80-years archive of Time weekly magazine are presented.
  • Keywords
    document image processing; visual databases; Time weekly magazine; advertisement identification; article extraction; digital contents; document structure identification; magazine archives; numeric content; section keyword matching; table of content identification; title matching;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on
  • ISSN
    1520-5263
  • Print_ISBN
    0-7695-2420-6
  • Type

    conf

  • DOI
    10.1109/ICDAR.2005.133
  • Filename
    1575743