DocumentCode :
3022237
Title :
Fast optical character recognition through glyph hashing for document conversion
Author :
Chellapilla, Kumar ; Simard, Patrice ; Nickolov, Radoslav
Author_Institution :
Microsoft Res., Redmond, WA, USA
fYear :
2005
fDate :
29 Aug.-1 Sept. 2005
Firstpage :
829
Abstract :
This paper proposes a glyph hashing approach to optical character recognition with applications in document conversion. The viability and efficiency of the approach is tested through its implementation in a print driver on 68,987 PDF documents containing 1.15 billion characters. Results indicate that a hash table with (a) 3.2 million hashes is sufficient to represent all characters from these documents, and (b) 480 fonts are sufficient to cover over 90% of these documents. Glyph recognizing experiments indicate that 80% of unique character glyphs and over 96% of all characters from unseen documents can be found in a hash table built using all 68,987 documents. The hashing approach is used to not only recognize the character codes but also, size, style (bold, italic, etc), and font name. We found that the hashing approach can scale to hundreds of fonts and thousands of characters per font. Further, it is extremely fast and can recognize over 100,000 characters per second. Owing to its speed, such a hashing approach can complement any existing OCR system by acting as a pre-filter to produce a 4-5 times speedup during document conversion.
Keywords :
character sets; document image processing; optical character recognition; table lookup; PDF document; character glyph; document conversion; glyph hashing; hash table; optical character recognition; Character recognition; Displays; Image converters; Image databases; Image retrieval; Information retrieval; Optical character recognition software; Optical distortion; Testing; Text processing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on
ISSN :
1520-5263
Print_ISBN :
0-7695-2420-6
Type :
conf
DOI :
10.1109/ICDAR.2005.110
Filename :
1575661
Link To Document :
بازگشت