A scale space approach for automatically segmenting words from historical handwritten documents

Author

Manmatha, R. ; Rothfeder, Jamie L.

Author_Institution

Dept. of Comput. Sci., Massachusetts Univ., Amherst, MA, USA

Volume

Issue

fYear

2005

Firstpage

1212

Lastpage

1225

Abstract

Many libraries, museums, and other organizations contain large collections of handwritten historical documents, for example, the papers of early presidents like George Washington at the Library of Congress. The first step in providing recognition/retrieval tools is to automatically segment handwritten pages into words. State of the art segmentation techniques like the gap metrics algorithm have been mostly developed and tested on highly constrained documents like bank checks and postal addresses. There has been little work on full handwritten pages and this work has usually involved testing on clean artificial documents created for the purpose of research. Historical manuscript images, on the other hand, contain a great deal of noise and are much more challenging. Here, a novel scale space algorithm for automatically segmenting handwritten (historical) documents into words is described. First, the page is cleaned to remove margins. This is followed by a gray-level projection profile algorithm for finding lines in images. Each line image is then filtered with an anisotropic Laplacian at several scales. This procedure produces blobs which correspond to portions of characters at small scales and to words at larger scales. Crucial to the algorithm is scale selection that is, finding the optimum scale at which blobs correspond to words. This is done by finding the maximum over scale of the extent or area of the blobs. This scale maximum is estimated using three different approaches. The blobs recovered at the optimum scale are then bounded with a rectangular box to recover the words. A post processing filtering step is performed to eliminate boxes of unusual size which are unlikely to correspond to words. The approach is tested on a number of different data sets and it is shown that, on 100 sampled documents from the George Washington corpus of handwritten document images, a total error rate of 17 percent is observed. The technique outperforms a state-of-the-art gap metr- - ics word-segmentation algorithm on this collection.

Keywords

document image processing; handwritten character recognition; image segmentation; -level projection profile algorithm; anisotropic Laplacian; art segmentation techniques; automatically segmenting words; gap metrics algorithm; historical handwritten documents; post processing filtering step; retrieval tools; scale space approach; Anisotropic magnetoresistance; Computer Society; Error analysis; Filtering; Handwriting recognition; Image segmentation; Laplace equations; Libraries; Testing; Text analysis; Index Terms- Segmentation; document analysis; document and text processing; document indexing; handwriting analysis; optical character recognition.; smoothing; Abstracting and Indexing as Topic; Algorithms; Archaeology; Artificial Intelligence; Automatic Data Processing; Computer Graphics; Databases, Factual; Handwriting; Image Enhancement; Image Interpretation, Computer-Assisted; Information Storage and Retrieval; Models, Statistical; Natural Language Processing; Numerical Analysis, Computer-Assisted; Pattern Recognition, Automated; Reading; Reproducibility of Results; Sensitivity and Specificity; Signal Processing, Computer-Assisted; Subtraction Technique; User-Computer Interface;

fLanguage

English

Journal_Title

Pattern Analysis and Machine Intelligence, IEEE Transactions on

Publisher

ieee

ISSN

0162-8828

Type

jour

DOI

10.1109/TPAMI.2005.150

Filename

1453510

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=939110