• DocumentCode
    1995514
  • Title

    Text normalization in code-mixed social media text

  • Author

    Dutta, Sukanya ; Saha, Tista ; Banerjee, Somnath ; Naskar, Sudip Kumar

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Jadavpur Univ., Kolkata, India
  • fYear
    2015
  • fDate
    9-11 July 2015
  • Firstpage
    378
  • Lastpage
    382
  • Abstract
    This paper addresses the problem of text normalization, an often overlooked problem in natural language processing, in code-mixed social media text. The objective of the work presented here is to correct English spelling errors in code-mixed social media text that contains English words as well as Romanized transliteration of words from another language, in this case Bangla. The targeted research problem also entails solving another problem, that of word-level language identification in code-mixed social media text. We employ a CRF based machine learning approach followed by post-processing heuristics for the word-level language identification task. For spelling correction, we used the noisy channel model of spelling correction. In addition, the spell checker model presented here tackles wordplay, contracted words and phonetic variations. Overall, the word-level language identification achieved 90.5% accuracy and the spell checker achieved 69.43% accuracy on the detected English words.
  • Keywords
    learning (artificial intelligence); natural language processing; social networking (online); text analysis; CRF based machine learning; English spelling errors; code-mixed social media text; natural language processing; romanized transliteration; text normalization; word-level language identification; Accuracy; Channel models; Dictionaries; Electronic mail; Matrices; Media; Noise measurement; code-mixed text; language identification; spell checking; text normalization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Recent Trends in Information Systems (ReTIS), 2015 IEEE 2nd International Conference on
  • Conference_Location
    Kolkata
  • Type

    conf

  • DOI
    10.1109/ReTIS.2015.7232908
  • Filename
    7232908