• DocumentCode
    2144192
  • Title

    Performance Evaluation of Algorithms for Newspaper Article Identification

  • Author

    Beretta, Roberto ; Laura, Luigi

  • Author_Institution
    Telpress, S.p.A., Rieti, Italy
  • fYear
    2011
  • fDate
    18-21 Sept. 2011
  • Firstpage
    394
  • Lastpage
    398
  • Abstract
    A typical modern newspaper recognition system operates in distinct phases: i) page segmentation (also called page decomposition or zoning), that is the process of decomposing a page into its structural and logical units (called regions or zones), ii) region (or zone) labeling, where the previously identified units are labeled according to their types (title, text, images, and lines), iii) article identification (or tracking or clustering), in which all the units that belong to a single article are clustered together, and iv) read order identification, in which each item in an article is assigned its reading order inside the article. So far, in the literature, several works appeared describing algorithms and metrics for the first two phases, i.e. page segmentation and region labeling, that indeed play a crucial role in the whole process, however, few results focused on article identification, that is a difficult task mainly due to the rich and complex variety of newspapers layouts. In this paper we propose a methodology to evaluate news-papers article identification algorithms, our approach is based on well-established tools from graph theory: in particular, we reduce the newspaper article clustering problem to a specific graph clustering problem, that is therefore evaluated using the appropriate coverage and performance measures. The advantages of our approach are twofold: on one side, the proposed measures correctly detects that not all the errors are equals, i.e. some errors are worse than others, and the scores are assigned properly. On the other side, we show how to reverse the reduction, in order to exploit the large number of graph clustering algorithm available: indeed, given a graph clustering algorithm, to obtain a full working newspaper article identification algorithm we only need to define a similarity measure between units in the article. We provide some examples, using a specifically designed dataset. Finally, we would like to point out that both our d- - ataset, together with its ground-truth base, and the software tool, that implements the proposed approach, are freely available.
  • Keywords
    document handling; graph theory; image segmentation; matrix algebra; pattern clustering; performance evaluation; publishing; software tools; graph clustering problem; graph theory; logical units; metrics; newspaper article clustering; newspaper article identification; page decomposition; page segmentation; performance evaluation; region labeling; software tool; structural units; Algorithm design and analysis; Clustering algorithms; Layout; Performance evaluation; Software algorithms; Software tools; graph clustering; newspaper article identification; performance evaluation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2011 International Conference on
  • Conference_Location
    Beijing
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4577-1350-7
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2011.87
  • Filename
    6065342