Title :
Evaluating OCR and non-OCR text representations for learning document classifiers
Author :
Junker, Markus ; Hoch, Rainer
Author_Institution :
Res. Center for Artificial Intelligence, Kaiserslautern, Germany
Abstract :
In the literature, many feature types and learning algorithms have been proposed for document classification. However, an extensive and systematic evaluation of the various approaches has not been done yet. In order to investigate different text representations for document classification, we have developed a tool which transforms documents into feature-value representations that are suitable for standard learning algorithms. In this paper, we investigate seven document representations for German texts based on n-grams and single words. We compare their effectiveness in classifying OCR texts and the corresponding correct ASCII texts in two domains: business letters and abstracts of technical reports. Our results indicate that the use of n-grams is an attractive technique which can even compare to techniques relying on a morphological analysis. This holds for OCR texts as well as for correct ASCII texts
Keywords :
abstracting; business forms; data structures; document image processing; image classification; learning systems; nomograms; optical character recognition; ASCII texts; German texts; OCR text representations; abstracts; business letters; document classification; feature types; feature-value representations; learning document classifiers; morphological analysis; n-grams; nonOCR text representations; single words; technical reports; Artificial intelligence; Business communication; Character recognition; Electronic mail; Frequency; Image recognition; Learning; Optical character recognition software; Standards development; Text analysis;
Conference_Titel :
Document Analysis and Recognition, 1997., Proceedings of the Fourth International Conference on
Conference_Location :
Ulm
Print_ISBN :
0-8186-7898-4
DOI :
10.1109/ICDAR.1997.620671