DocumentCode
1995514
Title
Text normalization in code-mixed social media text
Author
Dutta, Sukanya ; Saha, Tista ; Banerjee, Somnath ; Naskar, Sudip Kumar
Author_Institution
Dept. of Comput. Sci. & Eng., Jadavpur Univ., Kolkata, India
fYear
2015
fDate
9-11 July 2015
Firstpage
378
Lastpage
382
Abstract
This paper addresses the problem of text normalization, an often overlooked problem in natural language processing, in code-mixed social media text. The objective of the work presented here is to correct English spelling errors in code-mixed social media text that contains English words as well as Romanized transliteration of words from another language, in this case Bangla. The targeted research problem also entails solving another problem, that of word-level language identification in code-mixed social media text. We employ a CRF based machine learning approach followed by post-processing heuristics for the word-level language identification task. For spelling correction, we used the noisy channel model of spelling correction. In addition, the spell checker model presented here tackles wordplay, contracted words and phonetic variations. Overall, the word-level language identification achieved 90.5% accuracy and the spell checker achieved 69.43% accuracy on the detected English words.
Keywords
learning (artificial intelligence); natural language processing; social networking (online); text analysis; CRF based machine learning; English spelling errors; code-mixed social media text; natural language processing; romanized transliteration; text normalization; word-level language identification; Accuracy; Channel models; Dictionaries; Electronic mail; Matrices; Media; Noise measurement; code-mixed text; language identification; spell checking; text normalization;
fLanguage
English
Publisher
ieee
Conference_Titel
Recent Trends in Information Systems (ReTIS), 2015 IEEE 2nd International Conference on
Conference_Location
Kolkata
Type
conf
DOI
10.1109/ReTIS.2015.7232908
Filename
7232908
Link To Document