DocumentCode :
397296
Title :
Indexing and retrieval of words in old documents
Author :
Marinai, Simone ; Marino, Emanuele ; Soda, Giovanni
Author_Institution :
Florence Univ., Italy
fYear :
2003
fDate :
3-6 Aug. 2003
Firstpage :
223
Abstract :
This paper describes a system for efficient indexing and retrieval of words in collections of document images. The proposed method is based on two main principles: unsupervised prototype clustering, and string encoding for efficient string matching. During indexing, a self organizing map (SOM) is trained so as to cluster together similar symbols (character-like objects) in a sub-set of the documents to be stored. By using the trained SOM the words in the whole collection can be stored and represented with a fixed-length description that can be easily compared in order to score most similar words in response to a user query. The system can be automatically adapted to different languages and font styles. The most appropriate applications are for the processing of old documents (18th and 19th Centuries) where current OCRs have more difficulties. Experimental results describe three application scenarios having various levels of difficulty for current OCR systems.
Keywords :
character recognition; document image processing; feature extraction; image matching; indexing; information retrieval; OCR; SOM; optical character recognition; prototype clustering; self organizing map; string encoding; string matching; text recognition; word indexing; word retrieval; Content based retrieval; Data mining; Encoding; Image retrieval; Indexing; Internet; Optical character recognition software; Organizing; Prototypes; Software libraries;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on
Print_ISBN :
0-7695-1960-1
Type :
conf
DOI :
10.1109/ICDAR.2003.1227663
Filename :
1227663
Link To Document :
بازگشت