• DocumentCode
    3486471
  • Title

    Extraction of Spelling Variations from Language Structure for Noisy Text Correction

  • Author

    Gerdjikov, Stefan ; Mihov, Stoyan ; Nenchev, Vladislav

  • Author_Institution
    Fac. of Math. & Inf., Sofia Univ., Sofia, Bulgaria
  • fYear
    2013
  • fDate
    25-28 Aug. 2013
  • Firstpage
    324
  • Lastpage
    328
  • Abstract
    We describe a novel approach for the extraction of spelling variations from a list of instances. It relates distinctive infixes to distinctive infixes of referenced words. The distinctive infixes are extracted automatically from a (multi)set of instances and a referenced dictionary without any additional expert knowledge. Based on the spelling variations retrieved during a learning(training) phase we develop a correction algorithm which suggests and ranks candidates for a particular noisy word. The main advantage of our approach is that it provides good corrections for the unobserved noisy words while it is almost perfect on words observed during the learning. Our experimental results of the normalisation of a typical reference corpus of Early Modern English letters, [1], significantly improve over previous results of VARD2, [2]. We also achieve better results than those reported in [3] and [4] on the OCR-correction of the TREC-5 Confusion Track corpus,[5].
  • Keywords
    document image processing; image denoising; natural language processing; spelling aids; text analysis; Early Modern English letters; OCR-correction; TREC-5 confusion track corpus; VARD2; automatic distinctive infix extraction; correction algorithm; expert knowledge; language structure; learning phase; noisy text correction; reference corpus; referenced dictionary; spelling variation extraction; Approximation methods; Dictionaries; Educational institutions; Hidden Markov models; Noise measurement; Training; Upper bound; finite state automata; noisy texts correction; spelling variations;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
  • Conference_Location
    Washington, DC
  • ISSN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2013.72
  • Filename
    6628637