Performance Evaluation of Algorithms for Newspaper Article Identification

Author

Beretta, Roberto ; Laura, Luigi

Author_Institution

Telpress, S.p.A., Rieti, Italy

fYear

2011

fDate

18-21 Sept. 2011

Firstpage

394

Lastpage

398

Abstract

A typical modern newspaper recognition system operates in distinct phases: i) page segmentation (also called page decomposition or zoning), that is the process of decomposing a page into its structural and logical units (called regions or zones), ii) region (or zone) labeling, where the previously identified units are labeled according to their types (title, text, images, and lines), iii) article identification (or tracking or clustering), in which all the units that belong to a single article are clustered together, and iv) read order identification, in which each item in an article is assigned its reading order inside the article. So far, in the literature, several works appeared describing algorithms and metrics for the first two phases, i.e. page segmentation and region labeling, that indeed play a crucial role in the whole process, however, few results focused on article identification, that is a difficult task mainly due to the rich and complex variety of newspapers layouts. In this paper we propose a methodology to evaluate news-papers article identification algorithms, our approach is based on well-established tools from graph theory: in particular, we reduce the newspaper article clustering problem to a specific graph clustering problem, that is therefore evaluated using the appropriate coverage and performance measures. The advantages of our approach are twofold: on one side, the proposed measures correctly detects that not all the errors are equals, i.e. some errors are worse than others, and the scores are assigned properly. On the other side, we show how to reverse the reduction, in order to exploit the large number of graph clustering algorithm available: indeed, given a graph clustering algorithm, to obtain a full working newspaper article identification algorithm we only need to define a similarity measure between units in the article. We provide some examples, using a specifically designed dataset. Finally, we would like to point out that both our d- - ataset, together with its ground-truth base, and the software tool, that implements the proposed approach, are freely available.

Keywords

document handling; graph theory; image segmentation; matrix algebra; pattern clustering; performance evaluation; publishing; software tools; graph clustering problem; graph theory; logical units; metrics; newspaper article clustering; newspaper article identification; page decomposition; page segmentation; performance evaluation; region labeling; software tool; structural units; Algorithm design and analysis; Clustering algorithms; Layout; Performance evaluation; Software algorithms; Software tools; graph clustering; newspaper article identification; performance evaluation;

fLanguage

English

Publisher

ieee

Conference_Titel

Document Analysis and Recognition (ICDAR), 2011 International Conference on

Conference_Location

Beijing

ISSN

1520-5363

Print_ISBN

978-1-4577-1350-7

Electronic_ISBN

1520-5363

Type

conf

DOI

10.1109/ICDAR.2011.87

Filename

6065342