Title :
A Rule-Based Framework of Metadata Extraction from Scientific Papers
Author :
Guo, Zhixin ; Jin, Hai
Author_Institution :
Cluster & Grid Comput. Lab., Huazhong Univ. of Sci. & Technol., Wuhan, China
Abstract :
Most scientific documents on the web are unstructured or semi-structured, and the automatic document metadata extraction process becomes an important task. This paper describes a framework for automatic metadata extraction from scientific papers. Based on a spatial and visual knowledge principle, our system can extract title, authors and abstract from scientific papers. We utilize format information such as font size and position to guide the metadata extraction process. The experiment results show that our system achieves a high accuracy in header metadata extraction which can effectively assist the automatic index creation for digital libraries.
Keywords :
Internet; digital libraries; document handling; indexing; information retrieval; knowledge based systems; meta data; natural sciences computing; Web; automatic document metadata extraction; automatic index creation; digital libraries; header metadata extraction; rule-based framework; scientific documents; scientific papers; spatial knowledge principle; visual knowledge principle; Accuracy; Data mining; Layout; Libraries; Portable document format; Semantics; XML; document metadata; information extraction; rule-based approach;
Conference_Titel :
Distributed Computing and Applications to Business, Engineering and Science (DCABES), 2011 Tenth International Symposium on
Conference_Location :
Wuxi
Print_ISBN :
978-1-4577-0327-0
DOI :
10.1109/DCABES.2011.14