Author_Institution :
AT&T Bell Lab., Murray Hill, NJ, USA
Abstract :
An experimental printed-page reader that is easy to adapt to various languages is described. Changing the target language may involve simultaneous changes in symbol sets, typefaces, sizes of text, page layouts, linguistic contexts, and imaging defects. The strategy has been to isolate the effects of these sources of variation within separate, independent engineering subsystems. In this way, it has been possible to construct, with a minimum of manual effort, classifiers for arbitrary combinations of symbols, typefaces, sizes, and imaging defects. An attempt has been made to rid the algorithms of all language-specific rules, relying instead on automatic learning from examples and generalized table-driven methods. For some tasks it has been feasible to avoid language dependency altogether. Linguistic context can be exploited through data-directed filtering algorithms in a uniform and modular manner, so that preexisting tools developed by computational linguistics can readily be applied. These principles are illustrated by trials on English, Swedish, Tibetan, and special technical texts
Keywords :
computational linguistics; document image processing; optical character recognition; OCR; automatic learning; computational linguistics; data-directed filtering algorithms; imaging defects; language-specific rules; linguistic contexts; page layouts; page reader; symbol sets; table-driven methods; typefaces; Anatomy; Character recognition; Dictionaries; Encoding; Filtering algorithms; Optical character recognition software; Optical filters; Optical imaging; Shape control; Space technology;