Title :
Dolores: An Interactive and Class-Free Approach for Document Logical Restructuring
Author :
Bloechle, Jean-Luc ; Pugin, Catherine ; Ingold, Rolf
Abstract :
Physical and logical structure recovering from electronic documents is still an open issue. In this paper, we propose a flexible and efficient approach for recovering document structures from PDF files. After a brief introduction of the PDF format and its major features, we report about our evaluation of different existing tools and works for PDF content extraction and analysis. To overcome the weaknesses of these systems, we propose a new analysis strategy, based on an intermediate representation, called XCDF, which enables representing physical structures in a canonical way. This paper then describes the PDF reverse engineering workflow and focuses on the document logical restructuring. Finally, the paper concludes with potential future improvements.
Keywords :
Costs; Data mining; Feature extraction; Image databases; Postal services; Robustness; Sorting; Spatial databases; Transportation; Visual databases; document restructuring; logical structure; pdf reengineering; physical structure;
Conference_Titel :
Document Analysis Systems, 2008. DAS '08. The Eighth IAPR International Workshop on
Conference_Location :
Nara, Japan
Print_ISBN :
978-0-7695-3337-7
DOI :
10.1109/DAS.2008.44