Title :
An Automatic System for Extracting Figures and Captions in Biomedical PDF Documents
Author :
Lopez, Luis D. ; Yu, Jingyi ; Arighi, Cecilia N. ; Huang, Hongzhan ; Shatkay, Hagit ; Wu, Cathy
Author_Institution :
Dept. of Comput. & Inf. Sci., Univ. of Delaware, Newark, DE, USA
Abstract :
Figures in biomedical articles often constitute direct evidence of experimental results. Image analysis methods can be coupled with text-based methods to improve knowledge discovery. However, automatically harvesting figures along with their associated captions from full-text articles remains challenging. In this paper, we present an automatic system for robustly harvesting figures from biomedical literature. Our approach relies on the idea that the PDF specification of the document layout can be used to identify encoded figures and figure boundaries within the PDF and enforce constraints among figure-regions. This allows us to harvest fragments of figures (subflgures), from the PDF, correctly identify subfigures that belong to the same figure, and identify the captions associated with each figure. Our method simultaneously recovers figures and captions and applies additional filtering process to remove irrelevant figures such as logos, to eliminate text passages that were incorrectly identified as captions, and to re-group subflgures to generate a putative figure. Finally, we associate figures with captions. Our preliminary experiments suggest that our method achieves an accuracy of 95% in harvesting figures-caption pairs from a set of 2,035 full-text biomedical documents from BioCreative III, containing 12,574 figures.
Keywords :
data mining; formal specification; image retrieval; medical computing; text analysis; automatic system; biomedical PDF documents; biomedical articles; caption extraction; document layout PDF specification; figure extraction; filtering process; image analysis methods; knowledge discovery; Biomedical imaging; Databases; Image resolution; Layout; Merging; Portable document format; Robustness; biomedical documents; biomedical images; figures and captions; images; information retrieval;
Conference_Titel :
Bioinformatics and Biomedicine (BIBM), 2011 IEEE International Conference on
Conference_Location :
Atlanta, GA
Print_ISBN :
978-1-4577-1799-4
DOI :
10.1109/BIBM.2011.26