DocumentCode
778869
Title
Anatomy of a versatile page reader
Author
Baird, Henry S.
Author_Institution
AT&T Bell Lab., Murray Hill, NJ, USA
Volume
80
Issue
7
fYear
1992
fDate
7/1/1992 12:00:00 AM
Firstpage
1059
Lastpage
1065
Abstract
An experimental printed-page reader that is easy to adapt to various languages is described. Changing the target language may involve simultaneous changes in symbol sets, typefaces, sizes of text, page layouts, linguistic contexts, and imaging defects. The strategy has been to isolate the effects of these sources of variation within separate, independent engineering subsystems. In this way, it has been possible to construct, with a minimum of manual effort, classifiers for arbitrary combinations of symbols, typefaces, sizes, and imaging defects. An attempt has been made to rid the algorithms of all language-specific rules, relying instead on automatic learning from examples and generalized table-driven methods. For some tasks it has been feasible to avoid language dependency altogether. Linguistic context can be exploited through data-directed filtering algorithms in a uniform and modular manner, so that preexisting tools developed by computational linguistics can readily be applied. These principles are illustrated by trials on English, Swedish, Tibetan, and special technical texts
Keywords
computational linguistics; document image processing; optical character recognition; OCR; automatic learning; computational linguistics; data-directed filtering algorithms; imaging defects; language-specific rules; linguistic contexts; page layouts; page reader; symbol sets; table-driven methods; typefaces; Anatomy; Character recognition; Dictionaries; Encoding; Filtering algorithms; Optical character recognition software; Optical filters; Optical imaging; Shape control; Space technology;
fLanguage
English
Journal_Title
Proceedings of the IEEE
Publisher
ieee
ISSN
0018-9219
Type
jour
DOI
10.1109/5.156469
Filename
156469
Link To Document