Title :
Segmenting Tables via Indexing of Value Cells by Table Headers
Author :
Seth, Sachin ; Nagy, G.
Author_Institution :
Dept. of Comput. Sci. & Eng., Univ. of Nebraska-Lincoln Lincoln, Lincoln, NE, USA
Abstract :
Correct segmentation of a web table into its component regions is the essential first step to understanding tabular data. Our algorithmic solution to the segmentation problem relies on the property that strings defining row and column header paths uniquely index each data cell in the table. We segment the table using only "logical layout analysis" without resorting to any appearance features or natural language understanding. We start with a CSV table that preserves the 2-dimensional structure and contents of the original source table (e.g., an HTML table) but not font size, font weight, and color. The indexing property of table headers implies a four-quadrant partitioning of the table about a minimum index point. The algorithm finds the index point through an efficient guided search. Experimental results on a 200-table benchmark demonstrate the generality of the algorithm in handling a variety of table styles and forms.
Keywords :
Internet; database indexing; string matching; text analysis; CSV table; Web table segmentation; column header paths; comma-separated-values format; four-quadrant partitioning; logical layout analysis; minimum index point; row header paths; source table; table forms; table headers; table styles; tabular data; two-dimensional structure preservation; value cells; Algorithm design and analysis; HTML; Image color analysis; Indexing; Text analysis; indexing by header strings; minimum indexing point; table segmentation;
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/ICDAR.2013.181