Title :
Model-based information extraction method tolerant of OCR errors for document images
Author :
Ishitani, Yasuto
Author_Institution :
R&D Center, Toshiba Corp., Kawasaki, Japan
fDate :
6/23/1905 12:00:00 AM
Abstract :
A new method for information extraction from document images is proposed in this paper as the basis for a document reader which can extract required keywords and their logical relationship from various printed documents. Such documents obtained from OCR results may have not only unknown words and compound words, but also incorrect words due to OCR errors. To cope with OCR errors, the proposed method adopts robust keyword matching which searches for a string pattern from two dimensional OCR results consisting of a set of possible character candidates. This keyword matching uses a keyword dictionary that includes incorrect words with typical OCR errors and segments of words to deal with the above difficulties. After keyword matching, a global document matching is carried out between keyword matching results in an input document and document models which consist of keyword models and their logical relationship. This global matching determines the most suitable model for the input document and solves word segmentation problems accurately even if the document has unknown words, compound words, or incorrect words. Experimental results obtained for 100 documents show that the method is robust and effective for various document structures
Keywords :
dictionaries; document image processing; image matching; image segmentation; information retrieval; optical character recognition; string matching; OCR errors; compound words; document image processing; document reader; experimental results; global document matching; keyword dictionary; keyword matching; model-based information extraction; printed documents; unknown words; word segmentation; Data mining; Dictionaries; Error analysis; Image segmentation; Information analysis; Natural language processing; Optical character recognition software; Pattern matching; Research and development; Robustness;
Conference_Titel :
Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on
Conference_Location :
Seattle, WA
Print_ISBN :
0-7695-1263-1
DOI :
10.1109/ICDAR.2001.953918