Evaluating OCR and non-OCR text representations for learning document classifiers

Author

Junker, Markus ; Hoch, Rainer

Author_Institution

Res. Center for Artificial Intelligence, Kaiserslautern, Germany

Volume

2

fYear

1997

fDate

18-20 Aug 1997

Firstpage

1060

Abstract

In the literature, many feature types and learning algorithms have been proposed for document classification. However, an extensive and systematic evaluation of the various approaches has not been done yet. In order to investigate different text representations for document classification, we have developed a tool which transforms documents into feature-value representations that are suitable for standard learning algorithms. In this paper, we investigate seven document representations for German texts based on n-grams and single words. We compare their effectiveness in classifying OCR texts and the corresponding correct ASCII texts in two domains: business letters and abstracts of technical reports. Our results indicate that the use of n-grams is an attractive technique which can even compare to techniques relying on a morphological analysis. This holds for OCR texts as well as for correct ASCII texts

Keywords

abstracting; business forms; data structures; document image processing; image classification; learning systems; nomograms; optical character recognition; ASCII texts; German texts; OCR text representations; abstracts; business letters; document classification; feature types; feature-value representations; learning document classifiers; morphological analysis; n-grams; nonOCR text representations; single words; technical reports; Artificial intelligence; Business communication; Character recognition; Electronic mail; Frequency; Image recognition; Learning; Optical character recognition software; Standards development; Text analysis;

fLanguage

English

Publisher

ieee

Conference_Titel

Document Analysis and Recognition, 1997., Proceedings of the Fourth International Conference on

Conference_Location

Ulm

Print_ISBN

0-8186-7898-4

Type

conf

DOI

10.1109/ICDAR.1997.620671

Filename

620671