DocumentCode :
969580
Title :
Font adaptive word indexing of modern printed documents
Author :
Marinai, S. ; Marino, E. ; Soda, G.
Author_Institution :
Dipt. di Sistemi e Inf., Univ. di Firenze
Volume :
28
Issue :
8
fYear :
2006
Firstpage :
1187
Lastpage :
1199
Abstract :
We propose an approach for the word-level indexing of modern printed documents which are difficult to recognize using current OCR engines. By means of word-level indexing, it is possible to retrieve the position of words in a document, enabling queries involving proximity of terms. Web search engines implement this kind of indexing, allowing users to retrieve Web pages on the basis of their textual content. Nowadays, digital libraries hold collections of digitized documents that can be retrieved either by browsing the document images or relying on appropriate metadata assembled by domain experts. Word indexing tools would therefore increase the access to these collections. The proposed system is designed to index homogeneous document collections by automatically adapting to different languages and font styles without relying on OCR engines for character recognition. The approach is based on three main ideas: the use of self organizing maps (SOM) to perform unsupervised character clustering, the definition of one suitable vector-based word representation whose size depends on the word aspect-ratio, and the run-time alignment of the query word with indexed words to deal with broken and touching characters. The most appropriate applications are for processing modern printed documents (17th to 19th centuries) where current OCR engines are less accurate. Our experimental analysis addresses six data sets containing documents ranging from books of the 17th century to contemporary journals
Keywords :
character recognition; character sets; digital libraries; document image processing; image retrieval; indexing; pattern clustering; self-organising feature maps; word processing; OCR engines; Web pages; Web search engines; character recognition; digital libraries; digitized documents; document images; domain experts; font adaptive word-level indexing; font styles; index homogeneous document collections; modern printed documents; query word; run-time alignment; self organizing maps; textual content; unsupervised character clustering; vector-based word representation; word aspect-ratio; word position retrieval; Assembly; Character recognition; Content based retrieval; Image retrieval; Indexing; Optical character recognition software; Search engines; Software libraries; Web pages; Web search; Clustering; digital libraries; document image retrieval; heuristic oversegmentation; holistic word representation; modern documents; self organizing map.; Abstracting and Indexing as Topic; Algorithms; Artificial Intelligence; Automatic Data Processing; Computer Graphics; Documentation; Image Enhancement; Image Interpretation, Computer-Assisted; Information Storage and Retrieval; Libraries, Digital; Natural Language Processing; Pattern Recognition, Automated; Publishing; Reproducibility of Results; Semantics; Sensitivity and Specificity; Signal Processing, Computer-Assisted; Subtraction Technique; User-Computer Interface; Vocabulary, Controlled;
fLanguage :
English
Journal_Title :
Pattern Analysis and Machine Intelligence, IEEE Transactions on
Publisher :
ieee
ISSN :
0162-8828
Type :
jour
DOI :
10.1109/TPAMI.2006.162
Filename :
1642655
Link To Document :
بازگشت