DocumentCode
3486471
Title
Extraction of Spelling Variations from Language Structure for Noisy Text Correction
Author
Gerdjikov, Stefan ; Mihov, Stoyan ; Nenchev, Vladislav
Author_Institution
Fac. of Math. & Inf., Sofia Univ., Sofia, Bulgaria
fYear
2013
fDate
25-28 Aug. 2013
Firstpage
324
Lastpage
328
Abstract
We describe a novel approach for the extraction of spelling variations from a list of instances. It relates distinctive infixes to distinctive infixes of referenced words. The distinctive infixes are extracted automatically from a (multi)set of instances and a referenced dictionary without any additional expert knowledge. Based on the spelling variations retrieved during a learning(training) phase we develop a correction algorithm which suggests and ranks candidates for a particular noisy word. The main advantage of our approach is that it provides good corrections for the unobserved noisy words while it is almost perfect on words observed during the learning. Our experimental results of the normalisation of a typical reference corpus of Early Modern English letters, [1], significantly improve over previous results of VARD2, [2]. We also achieve better results than those reported in [3] and [4] on the OCR-correction of the TREC-5 Confusion Track corpus,[5].
Keywords
document image processing; image denoising; natural language processing; spelling aids; text analysis; Early Modern English letters; OCR-correction; TREC-5 confusion track corpus; VARD2; automatic distinctive infix extraction; correction algorithm; expert knowledge; language structure; learning phase; noisy text correction; reference corpus; referenced dictionary; spelling variation extraction; Approximation methods; Dictionaries; Educational institutions; Hidden Markov models; Noise measurement; Training; Upper bound; finite state automata; noisy texts correction; spelling variations;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location
Washington, DC
ISSN
1520-5363
Type
conf
DOI
10.1109/ICDAR.2013.72
Filename
6628637
Link To Document