DocumentCode
2144192
Title
Performance Evaluation of Algorithms for Newspaper Article Identification
Author
Beretta, Roberto ; Laura, Luigi
Author_Institution
Telpress, S.p.A., Rieti, Italy
fYear
2011
fDate
18-21 Sept. 2011
Firstpage
394
Lastpage
398
Abstract
A typical modern newspaper recognition system operates in distinct phases: i) page segmentation (also called page decomposition or zoning), that is the process of decomposing a page into its structural and logical units (called regions or zones), ii) region (or zone) labeling, where the previously identified units are labeled according to their types (title, text, images, and lines), iii) article identification (or tracking or clustering), in which all the units that belong to a single article are clustered together, and iv) read order identification, in which each item in an article is assigned its reading order inside the article. So far, in the literature, several works appeared describing algorithms and metrics for the first two phases, i.e. page segmentation and region labeling, that indeed play a crucial role in the whole process, however, few results focused on article identification, that is a difficult task mainly due to the rich and complex variety of newspapers layouts. In this paper we propose a methodology to evaluate news-papers article identification algorithms, our approach is based on well-established tools from graph theory: in particular, we reduce the newspaper article clustering problem to a specific graph clustering problem, that is therefore evaluated using the appropriate coverage and performance measures. The advantages of our approach are twofold: on one side, the proposed measures correctly detects that not all the errors are equals, i.e. some errors are worse than others, and the scores are assigned properly. On the other side, we show how to reverse the reduction, in order to exploit the large number of graph clustering algorithm available: indeed, given a graph clustering algorithm, to obtain a full working newspaper article identification algorithm we only need to define a similarity measure between units in the article. We provide some examples, using a specifically designed dataset. Finally, we would like to point out that both our d- - ataset, together with its ground-truth base, and the software tool, that implements the proposed approach, are freely available.
Keywords
document handling; graph theory; image segmentation; matrix algebra; pattern clustering; performance evaluation; publishing; software tools; graph clustering problem; graph theory; logical units; metrics; newspaper article clustering; newspaper article identification; page decomposition; page segmentation; performance evaluation; region labeling; software tool; structural units; Algorithm design and analysis; Clustering algorithms; Layout; Performance evaluation; Software algorithms; Software tools; graph clustering; newspaper article identification; performance evaluation;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location
Beijing
ISSN
1520-5363
Print_ISBN
978-1-4577-1350-7
Electronic_ISBN
1520-5363
Type
conf
DOI
10.1109/ICDAR.2011.87
Filename
6065342
Link To Document