Anatomy of a versatile page reader

Author

Baird, Henry S.

Author_Institution

AT&T Bell Lab., Murray Hill, NJ, USA

Volume

80

Issue

7

fYear

1992

fDate

7/1/1992 12:00:00 AM

Firstpage

1059

Lastpage

1065

Abstract

An experimental printed-page reader that is easy to adapt to various languages is described. Changing the target language may involve simultaneous changes in symbol sets, typefaces, sizes of text, page layouts, linguistic contexts, and imaging defects. The strategy has been to isolate the effects of these sources of variation within separate, independent engineering subsystems. In this way, it has been possible to construct, with a minimum of manual effort, classifiers for arbitrary combinations of symbols, typefaces, sizes, and imaging defects. An attempt has been made to rid the algorithms of all language-specific rules, relying instead on automatic learning from examples and generalized table-driven methods. For some tasks it has been feasible to avoid language dependency altogether. Linguistic context can be exploited through data-directed filtering algorithms in a uniform and modular manner, so that preexisting tools developed by computational linguistics can readily be applied. These principles are illustrated by trials on English, Swedish, Tibetan, and special technical texts

Keywords

computational linguistics; document image processing; optical character recognition; OCR; automatic learning; computational linguistics; data-directed filtering algorithms; imaging defects; language-specific rules; linguistic contexts; page layouts; page reader; symbol sets; table-driven methods; typefaces; Anatomy; Character recognition; Dictionaries; Encoding; Filtering algorithms; Optical character recognition software; Optical filters; Optical imaging; Shape control; Space technology;

fLanguage

English

Journal_Title

Proceedings of the IEEE

Publisher

ieee

ISSN

0018-9219

Type

jour

DOI

10.1109/5.156469

Filename

156469