Title :
A Realistic Dataset for Performance Evaluation of Document Layout Analysis
Author :
Antonacopoulos, A. ; Bridson, D. ; Papadopoulos, C. ; Pletschacher, S.
Author_Institution :
Res. Lab. Sch. of Comput., Sci. & Eng., Univ. of Salford, Manchester, UK
Abstract :
There is a significant need for a realistic dataset on which to evaluate layout analysis methods and examine their performance in detail. This paper presents a new dataset (and the methodology used to create it) based on a wide range of contemporary documents. Strong emphasis is placed on comprehensive and detailed representation of both complex and simple layouts, and on colour originals. In-depth information is recorded both at the page and region level. Ground truth is efficiently created using a new semi-automated tool and stored in a new comprehensive XML representation, the PAGE format. The dataset can be browsed and searched via a Web-based front end to the underlying database and suitable subsets (relevant to specific evaluation goals) can be selected and downloaded.
Keywords :
XML; document handling; online front-ends; software performance evaluation; PAGE format; Web-based front end; comprehensive XML representation; contemporary documents; document layout analysis; performance evaluation; realistic dataset; Data engineering; Databases; Image analysis; Image color analysis; Image recognition; Pattern analysis; Pattern recognition; Performance analysis; Text analysis; XML; Performance evaluation; datasets; ground truth format; layout analysis; pge segmentation; region classification;
Conference_Titel :
Document Analysis and Recognition, 2009. ICDAR '09. 10th International Conference on
Conference_Location :
Barcelona
Print_ISBN :
978-1-4244-4500-4
Electronic_ISBN :
1520-5363
DOI :
10.1109/ICDAR.2009.271