مرکز منطقه ای اطلاع رساني علوم و فناوري - PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents

DocumentCode :

1634305

Title :

PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents

Author :

Oro, Ermelinda ; Ruffolo, Massimo

Author_Institution :

Dept. of Comput. & Syst. Sci., Univ. of Calabria, Rende, Italy

fYear :

2009

Firstpage :

906

Lastpage :

910

Abstract :

This paper presents PDF-TREX, an heuristic approach for table recognition and extraction from PDF documents.The heuristics starts from an initial set of basic content elements and aligns and groups them, in bottom-up way by considering only their spatial features, in order to identify tabular arrangements of information. The scope of the approach is to recognize tables contained in PDF documents as a 2-dimensional grid on a Cartesian plane and extract them as a set of cells equipped by 2-dimensional coordinates. Experiments, carried out on a dataset composed of tables contained in documents coming from different domains, shows that the approach is well performing in recognizing table cells.The approach aims at improving PDF document annotation and information extraction by providing an output that can be further processed for understanding table and document contents.

Keywords :

data visualisation; document image processing; image recognition; 2D Cartesian plane grid; PDF documents; PDF-TREX approach; table recognition and extraction; Councils; Data mining; Encoding; HTML; High performance computing; Humans; Layout; Text analysis; Visualization; XML; Document Analysis; Hierarchical Clustering; Information Extraction; Table Recognition and Extraction;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Document Analysis and Recognition, 2009. ICDAR '09. 10th International Conference on

Conference_Location :

Barcelona

ISSN :

1520-5363

Print_ISBN :

978-1-4244-4500-4

Electronic_ISBN :

1520-5363

Type :

conf

DOI :

10.1109/ICDAR.2009.12

Filename :

5277546

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1634305