DocumentCode :
2530276
Title :
Xed: a new tool for extracting hidden structures from electronic documents
Author :
Hadjar, Karim ; Rigamonti, Maurizio ; Lalanne, Denis ; Ingold, Rolf
Author_Institution :
DIUF, Fribourg Univ., Switzerland
fYear :
2004
fDate :
2004
Firstpage :
212
Lastpage :
224
Abstract :
PDF became a very common format for exchanging printable documents. Further, it can be easily generated from the major documents formats, which make a huge number of PDF documents available over the net. However its use is limited to displaying and printing, which considerably reduces the search and retrieval capabilities. For this reason, additional tools have recently appeared that allow to extract the textual content. However their practical use is limited in the sense that the text´s reading order is not necessary preserved, especially when handling multicolumn documents, or in presence of complex layout. Our thesis is that those tools do not consider the hidden layout and logical structures of documents, which could greatly improve their results. We propose a novel approach to overcome the document content extraction, by merging a) low-level extraction methods applied on PDF files with b) layout analysis performed on a synthetically generated TIFF image. The paper describes the various steps necessary to achieve this task. Finally, we present a first experiment on the restitution of the newspapers´ reading order which shows encouraging results.
Keywords :
document image processing; feature extraction; PDF documents; PDF files; Xed tool; electronic documents; hidden structure extraction; multicolumn document handling; text reading order; textual content extraction; Collaborative work; Data mining; Document handling; Graphics; Image analysis; Indexing; Merging; Printing; Text analysis; XML;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Image Analysis for Libraries, 2004. Proceedings. First International Workshop on
Print_ISBN :
0-7695-2088-X
Type :
conf
DOI :
10.1109/DIAL.2004.1263250
Filename :
1263250
Link To Document :
بازگشت