Title :
Word Retrieval in Historical Document Using Character-Primitives
Author :
Roy, Partha Pratim ; Ramel, Jean-Yves ; Ragot, Nicolas
Author_Institution :
Lab. d´´Inf., Univ. Francois Rabelais, Tours, France
Abstract :
Word searching and indexing in historical document collections is a challenging problem because, characters in these documents are often touching or broken due to degradation/ ageing effects. For efficient searching in such historical documents, this paper presents a novel approach towards word spotting using string matching of character primitives. We describe the text string as a sequence of primitives which consists of a single character or a part of a character. Primitive segmentation is performed analyzing text background information that is obtained by water reservoir technique. Next, the primitives are clustered using template matching and a codebook of representative primitives is built. Using this primitive codebook, the text information in the document images are encoded and stored. For a query word, we segment it into primitives and encode the word by a string of representative primitives from codebook. Finally, an approximate string matching is applied to find similar words. The matching similarity is used to rank the retrieved words. The proposed method is tested on historical books of French alphabets and we have obtained encouraging results from the experiment.
Keywords :
document handling; information retrieval; natural language processing; French alphabets; Primitive segmentation; character primitives; document images; historical document collections; primitive codebook; template matching; text string; water reservoir technique; word retrieval; word searching; Image segmentation; Indexing; Layout; Reservoirs; Shape; Text analysis;
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4577-1350-7
Electronic_ISBN :
1520-5363
DOI :
10.1109/ICDAR.2011.142