DocumentCode :
266929
Title :
Bangla word clustering based on N-gram language model
Author :
Ismail, Sabir ; Rahman, Md Saifur
Author_Institution :
Dept. of Comput. Sci. & Eng., Shahjalal Univ. of Sci. & Technol., Shahjalal, Bangladesh
fYear :
2014
fDate :
10-12 April 2014
Firstpage :
1
Lastpage :
5
Abstract :
In this paper, we describe a method for producing Bangla word clusters based on semantic and contextual similarity. Word clustering is important for parts of speech (POS) tagging, word sense disambiguation, text classification, recommender system, spell checker, grammar checker, knowledge discover and for many others Natural Language Processing (NLP) applications. Computerization of Bangla language processing has been started a long ago, but still it is in neophyte stage and suffers from resource scarcity. We propose an unsupervised machine learning technique to develop Bangla word clusters based on their semantic and contextual similarity using N-gram language model. According to N-gram model, a word can be predicted based on its previous and next words sequence. N-gram model is applied successfully for word clustering in English and some other languages. As word clustering in Bangla is a new dimension in Bangla language processing research, so we think this process is good way to start and our assumption is true as our result is quite decent. We produced 456 clusters using a locally available large Bangla corpus. Subjective score derived from the clusters reveal strong similarity of the words in the same cluster.
Keywords :
grammars; natural language processing; pattern classification; pattern clustering; recommender systems; text analysis; unsupervised learning; Bangla corpus; Bangla language processing; Bangla word clustering; English; N-gram language model; NLP; contextual similarity; grammar checker; information retrieval; knowledge discover; natural language processing applications; parts-of-speech tagging; recommender system; semantic similarity; spell checker; text classification; unsupervised machine learning technique; word sense disambiguation; words sequence; Computational modeling; Context; Educational institutions; Mathematical model; Natural language processing; Semantics; Speech processing; information retrival; machine learning; n-gram model; natural language processing; word cluster;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Electrical Engineering and Information & Communication Technology (ICEEICT), 2014 International Conference on
Conference_Location :
Dhaka
Print_ISBN :
978-1-4799-4820-8
Type :
conf
DOI :
10.1109/ICEEICT.2014.6919083
Filename :
6919083
Link To Document :
بازگشت