DocumentCode :
2144935
Title :
An Impact of OCR Errors on Automated Classification of OCR Japanese Texts with Parts-of-Speech Analysis
Author :
Kokawa, Akihiro ; Busagala, Lazaro S P ; Ohyama, Wataru ; Wakabayashi, Tetsushi ; Kimura, Fumitaka
Author_Institution :
Grad. Sch. of Eng., Mie Univ., Tsu, Japan
fYear :
2011
fDate :
18-21 Sept. 2011
Firstpage :
543
Lastpage :
547
Abstract :
The technology of Optical Character Recognition (OCR) is used to generate texts in the process of digitizing print documents. Usually these texts need to be indexed and organized to simplify their access and retrieval. One of the powerful approaches in accomplishing this task is the use of Automated Text Classification. However, it is currently impossible for OCR technology to recognize all characters with an accuracy of 100%. We therefore propose the use of combined linguistic features in automated classification of OCR texts to formulate an informative feature set. The proposed method was experimentally evaluated using Japanese OCR texts. Empirical results indicate that the combination of linguistic features improved classification performance of OCR texts.
Keywords :
information retrieval; optical character recognition; pattern classification; set theory; speech processing; text analysis; Japanese OCR text; OCR error; OCR technology; OCR text classification performance; automated text classification; informative feature set; linguistic feature; optical character recognition; parts-of-speech analysis; print document digitization; Equations; Feature extraction; Optical character recognition software; Pragmatics; Support vector machines; Text categorization; Vectors; Combined Linguistic features; Feature generation; Feature transformation; OCR Japanese text classification or categorization; Parts of speech analysis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location :
Beijing
ISSN :
1520-5363
Print_ISBN :
978-1-4577-1350-7
Electronic_ISBN :
1520-5363
Type :
conf
DOI :
10.1109/ICDAR.2011.115
Filename :
6065370
Link To Document :
بازگشت