DocumentCode :
2148446
Title :
Searching OCR´ed Text: An LDA Based Approach
Author :
Hassan, Ehtesham ; Garg, Vikram ; Haque, S. K Mirajul ; Chaudhury, Santanu ; Gopal, M.
Author_Institution :
Dept. of Electr. Eng., Indian Inst. of Technol. Delhi, New Delhi, India
fYear :
2011
fDate :
18-21 Sept. 2011
Firstpage :
1210
Lastpage :
1214
Abstract :
Indexing and retrieval performance over digitized document collection significantly depends on the performance of available Optical Character Recognition (OCR). The paper presents a novel document indexing framework which attends the document digitization errors in the indexing process to improve the overall retrieval accuracy. The proposed indexing framework is based on topic modeling using Latent Dirichlet Allocation (LDA). The OCR´s confidence in correctly recognizing a symbol is propagated in topic learning process such that semantic grouping of word examples carefully distinguishes between commonly confusing words. We present a novel application of Lucene with topic modeling for document indexing application. The experimental evaluation of the proposed framework is presented on document collection belonging to Devanagari script.
Keywords :
document image processing; information retrieval; learning (artificial intelligence); optical character recognition; Devanagari script; LDA based approach; Lucene; OCR text searching; digitized document collection; document indexing framework; latent dirichlet allocation; retrieval performance; semantic word grouping; topic learning process; Character recognition; Indexing; Optical character recognition software; Resource management; Semantics; Vectors; Vocabulary; Document Retrieval; Latent Dirichlet Allocation; Optical Character Recognition;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location :
Beijing
ISSN :
1520-5363
Print_ISBN :
978-1-4577-1350-7
Electronic_ISBN :
1520-5363
Type :
conf
DOI :
10.1109/ICDAR.2011.244
Filename :
6065502
Link To Document :
بازگشت