Title :
A statistical refinement method for word shape token querying of document images
Author :
O´Connor, Jerh ; Smeaton, Alan F.
Author_Institution :
Sch. of Comput. Applications, Dublin City Univ., Ireland
Abstract :
Word Shape Tokens (WSTs) are tokens used to represent words based on the overall shape or contour of a word as it appears in printed text. A character shape code (CSC) mapping function is used to aggregate similarly shaped letters such as “g” and “y” into one single code to represent those letters. The rationale behind this is that it is far easier and more accurate to map a scanned image of a word or letter into its WST representation than it is to map into its full ASCII representation. In previous work we showed that user-mediated selection of WSTs for querying document images improved system performance. In the work reported here we use a statistically derived dataset to help determine whether or not a particular WST from a scanned document image actually matches a query term WST. We do this by comparing the preceding and following WSTs of the each WST in a document against previously collected frequency data for a large set of WST occurrences
Keywords :
document image processing; optical character recognition; visual databases; ASCII representation; character shape code mapping function; contour; document images; similarly shaped letters; statistical refinement method; statistically derived dataset; system performance; user-mediated selection; word shape token querying; Aggregates; Frequency; Optical character recognition software; Search engines; Shape;
Conference_Titel :
Database and Expert Systems Applications, 1999. Proceedings. Tenth International Workshop on
Conference_Location :
Florence
Print_ISBN :
0-7695-0281-4
DOI :
10.1109/DEXA.1999.795248