DocumentCode :
1995514
Title :
Text normalization in code-mixed social media text
Author :
Dutta, Sukanya ; Saha, Tista ; Banerjee, Somnath ; Naskar, Sudip Kumar
Author_Institution :
Dept. of Comput. Sci. & Eng., Jadavpur Univ., Kolkata, India
fYear :
2015
fDate :
9-11 July 2015
Firstpage :
378
Lastpage :
382
Abstract :
This paper addresses the problem of text normalization, an often overlooked problem in natural language processing, in code-mixed social media text. The objective of the work presented here is to correct English spelling errors in code-mixed social media text that contains English words as well as Romanized transliteration of words from another language, in this case Bangla. The targeted research problem also entails solving another problem, that of word-level language identification in code-mixed social media text. We employ a CRF based machine learning approach followed by post-processing heuristics for the word-level language identification task. For spelling correction, we used the noisy channel model of spelling correction. In addition, the spell checker model presented here tackles wordplay, contracted words and phonetic variations. Overall, the word-level language identification achieved 90.5% accuracy and the spell checker achieved 69.43% accuracy on the detected English words.
Keywords :
learning (artificial intelligence); natural language processing; social networking (online); text analysis; CRF based machine learning; English spelling errors; code-mixed social media text; natural language processing; romanized transliteration; text normalization; word-level language identification; Accuracy; Channel models; Dictionaries; Electronic mail; Matrices; Media; Noise measurement; code-mixed text; language identification; spell checking; text normalization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Recent Trends in Information Systems (ReTIS), 2015 IEEE 2nd International Conference on
Conference_Location :
Kolkata
Type :
conf
DOI :
10.1109/ReTIS.2015.7232908
Filename :
7232908
Link To Document :
بازگشت