DocumentCode :
1638519
Title :
Metadata Extraction from PDF Papers for Digital Library Ingest
Author :
Marinai, Simone
Author_Institution :
Dipt. di Sist. e Inf., Univ. di Firenze, Firenze, Italy
fYear :
2009
Firstpage :
251
Lastpage :
255
Abstract :
In this paper we analyze our recent research on the use of document analysis techniques for metadata extraction from PDF papers. We describe a package that is designed to extract basic metadata from these documents. The package is used in combination with a digital library software suite to easily build personal digital libraries. The proposed software is based on a suitable combination of several techniques that include PDF parsing, low level document image processing, and layout analysis. In addition, we use the information gathered from a widely known citation database (DBLP) to assist the tool in the difficult task of author identification. The system is tested on some paper collections selected from recent conference proceedings.
Keywords :
citation analysis; digital libraries; document image processing; grammars; meta data; PDF papers; PDF parsing; author identification; citation database; digital library software suite; document analysis; layout analysis; low level document image processing; metadata extraction; Data mining; Electronics packaging; HTML; Internet; Labeling; Open source software; Publishing; Software libraries; Software packages; Text analysis; Digital Library; Layout Anlysis; Neural Network; PDF; XML;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition, 2009. ICDAR '09. 10th International Conference on
Conference_Location :
Barcelona
ISSN :
1520-5363
Print_ISBN :
978-1-4244-4500-4
Electronic_ISBN :
1520-5363
Type :
conf
DOI :
10.1109/ICDAR.2009.232
Filename :
5277711
Link To Document :
بازگشت