Title :
Word segmentation of handwritten dates in historical documents by combining semantic a-priori-knowledge with local features
Author :
Feldbach, Markus ; Tönnies, Klaus D.
Author_Institution :
Dept. of Simulation & Graphics, Otto-von-Guericke Univ., Magdeburg, Germany
Abstract :
The recognition of script in historical documents requires suitable techniques in order to identify single words. Segmentation of lines and words is a challenging task because lines are not straight and words may intersect within and between lines. For correct word segmentation, the conventional analysis of distances between text objects needs to be supplemented by a second component predicting possible word boundaries based on semantical information. For date entries, hypotheses about potential boundaries are generated based on knowledge about the different variations as to how dates are written in the documents. It is modeled by distribution curves for potential boundary locations. Word boundaries are detected by classification of local features, such as distances between adjacent text objects, together with location-based boundary distribution curves as a-priori knowledge. We applied the technique to date entries in historical church registers. Documents from the 18th and 19th century were used for training and testing. The data set consisted of 674 word boundaries in 298 date entries. Our algorithm found the correct separation under the best four hypotheses for a word sequence in 97% of all cases in the test data set.
Keywords :
document handling; handwritten character recognition; image segmentation; records management; text analysis; church registers; date entries; handwritten dates; historical documents; line segmentation; location-based boundary distribution curves; old documents; potential boundary locations; script recognition; semantic a priori knowledge; semantical information; single word identification; text object distance analysis; word boundaries; word segmentation; word sequence; Computational modeling; Computer graphics; Computer simulation; Computer vision; Data mining; Information analysis; Object detection; Registers; Testing; Text analysis;
Conference_Titel :
Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on
Print_ISBN :
0-7695-1960-1
DOI :
10.1109/ICDAR.2003.1227684