DocumentCode :
2060505
Title :
A retargetable table reader
Author :
Shamilian, John H. ; Baird, Henry S. ; Wood, Thomas L.
Author_Institution :
Lucent Technol. Inc., AT&T Bell Labs., Holmdel, NJ, USA
Volume :
1
fYear :
1997
fDate :
18-20 Aug 1997
Firstpage :
158
Abstract :
We describe the architecture of a system for reading machine-printed documents in known predefined tabular-data layout styles. In these tables, textual data are presented in record lines made up of fixed-width fields. Tables often do not rely on line-art (ruled lines) to delimit fields, and in this way differ crucially from fixed forms. Our system performs these steps: copes with multiple tables per page; identifies records within tables; segments records into fields; and recognizes characters within fields, constrained by field-specific contextual knowledge. Obstacles to good performance on tables include small print, tight line-spacing, poor-quality text (such as photocopies), and line-art or background patterns that touch the text. Precise skew-correction and pitch-estimation, and high-performance OCR using neural nets proved crucial in overcoming these obstacles. The most significant technical advances in this work appear to be algorithms for identifying and segmenting records with known layout, and integration of these algorithms with a graphical user interface (GUI) for defining new layouts. This GUI has been ergonomically designed to make efficient and intuitive use of exemplary images, so that the skill and manual effort required to retarget the system to new table layouts are held to a minimum. The system has been applied in this way to more than 400 distinct tabular layouts. During the last three years the system has read over fifty million records with high accuracy
Keywords :
document image processing; image segmentation; neural nets; optical character recognition; background patterns; field-specific contextual knowledge; fixed-width fields; graphical user interface; high-performance OCR; line-art; machine-printed documents; neural nets; photocopies; pitch-estimation; predefined tabular-data layout; record lines; retargetable table reader; segmentation; skew-correction; small print; textual data; tight line-spacing; Business; Character recognition; Finance; Graphical user interfaces; Image segmentation; Layout; Medical services; Neural networks; Optical character recognition software; Telecommunications;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition, 1997., Proceedings of the Fourth International Conference on
Conference_Location :
Ulm
Print_ISBN :
0-8186-7898-4
Type :
conf
DOI :
10.1109/ICDAR.1997.619833
Filename :
619833
Link To Document :
بازگشت