Title :
Learning of Pattern-Based Rules for Document Classification
Author :
Dengel, Andreas R.
Author_Institution :
Univ. of Kaiserslautern, Kaiserslautern
Abstract :
Automatic processing of office documents, such as orders, invoices, or offers entails a significant potential for saving costs. Because such domains have a high percentage of special vocabulary, purely statistical approaches fail in automatic classification. The inherent structure and short text messages require specific approaches. We propose a rule-based method to classify mixed stacks of documents into a set of hierarchically organized classes. Rules are learned by extracting patterns of different types from a document sample. The paper focuses on the architecture and on the learning process, presents comparing results to other techniques, and gives an outlook on how to further improve the system.
Keywords :
document image processing; image classification; knowledge based systems; learning (artificial intelligence); document classification; office documents; pattern-based rules; rule-based method; Cost function; Delay effects; Dispatching; Filtering; Optical character recognition software; Postal services; Routing; Text analysis; Vocabulary; Voting;
Conference_Titel :
Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on
Conference_Location :
Parana
Print_ISBN :
978-0-7695-2822-9
DOI :
10.1109/ICDAR.2007.4378688