Title :
A Model Based Framework for Table Processing in Degraded Document Images
Author :
Zhixin Shi ; Setlur, Srirangaraj ; Govindaraju, Vengatesan
Author_Institution :
Dept. of Comput. Sci. & Eng., State Univ. of New York, Buffalo, NY, USA
Abstract :
This paper describes a model based framework for detection and extraction of the contents of table cells from degraded handwritten document images that contain tables. Given the very poor quality of the target documents, the table cell detection problem is formulated conceptually as a two-step process. The first step is to identify the location of the table and extract the content of table cells given a model of the structure of the table present in the image. The second step is to identify the model of the table present in a document image from a list of given table models. A model-based representation for tables is introduced and is used for matching table candidates with the given model to identify and extract the contents of table cells. The approach for detecting potential table candidates is based on the detection of horizontal and vertical table line candidates. The table representation is a matrix of horizontal and vertical table line crossings, and the matching algorithm is formulated as a minimization problem where the optimal table candidate is obtained using the minimal distance between the candidate and model table matrices which is then used for extraction of the table cell contents. A similar approach is used to solve the model selection problem where the best fitting location in the document page for each of the candidate models is identified using the distance minimization approach along with a confidence score and the model with the highest confidence score is selected as the correct model. The approach was tested on document page images containing tables from the challenge set of the DARPA MADCAT handwritten document image data. Results indicate that the method is effective for both model selection as well as table cell content extraction.
Keywords :
document image processing; feature extraction; handwriting recognition; image matching; minimisation; DARPA MADCAT handwritten document image data; degraded document images; handwritten document images; horizontal table line; matching algorithm; matching table; minimization problem; model based framework; table cell contents; table cell detection problem; table processing; table representation; target documents; vertical table line; Dynamic programming; HTML; Image segmentation; Matrix converters; Noise; Text analysis; Handwritten Arabic documents; Handwritten document processing; Table cell extraction; Table detection; Table processing;
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/ICDAR.2013.195