DocumentCode :
1994281
Title :
Sentiment analysis using cosine similarity measure
Author :
Bhattacharjee, Saprativa ; Das, Anirban ; Bhattacharya, Ujjwal ; Parui, Swapan K. ; Roy, Sudipta
Author_Institution :
Dept. of Inf. Technol., Sikkim Manipal Inst. of Technol., Majitar, India
fYear :
2015
fDate :
9-11 July 2015
Firstpage :
27
Lastpage :
32
Abstract :
The opinion of other people is often a major factor influencing our decisions. For a consumer it affects purchase decisions and for a producer or a service provider it helps in making business decisions. Companies spend a lot of money and time on surveys for gathering the public opinion on products and services. Now-a-days the web has become a hotspot for finding user opinions on almost anything under the sun. Both money and time can be saved by mining opinions from the web. Moreover, no survey can have a sample size, which can match that of the web. Each opinion generally expresses either positive, negative or neutral sentiment. The task of identifying these sentiments is called Sentiment Analysis. This work deals with the analysis of user sentiments in the Telecom domain. Since no such related standard database of users´ opinions could be found, we developed one by mining the WWW. A major issue with these sample comments is that these are usually extremely noisy, containing numerous spelling and grammatical errors, acronyms, abbreviations, shortened or slang words etc. Such data cannot be used directly for analyzing sentiments. Hence, a lexicon based preprocessing algorithm is proposed for noise reduction. A novel idea based on Cosine Similarity measure is proposed for classifying the sentiment expressed by a user´s comment into a five point scale of -2 (highly negative) to +2 (highly positive). The performance of the proposed strategy is compared with some of the well-known machine learning algorithms namely, Naive Bayes, Maximum Entropy and SVM. The proposed Cosine Similarity based classifier gives 82.09% accuracy for the 2-class problem of identifying positive and negative sentiments. It outperforms all other classifiers by a considerable margin in the 5-class sentiment classification problem with an accuracy of 71.5%. The same strategy is also used for categorizing each user comment into six different Telecom specific categories.
Keywords :
Bayes methods; Internet; data mining; learning (artificial intelligence); maximum entropy methods; natural language processing; pattern classification; support vector machines; telecommunication services; text analysis; SVM; abbreviations; acronyms; business decision making; cosine similarity measure; grammatical errors; lexicon based preprocessing algorithm; machine learning algorithms; maximum entropy; naive Bayes; noise reduction; opinion mining; public opinion; purchase decision; sentiment analysis; sentiment classification; sentiment identification; service provider; shortened words; slang words; spelling errors; telecom domain; user comment; user opinion; Accuracy; Dictionaries; Frequency conversion; Sentiment analysis; Support vector machines; Telecommunications; Training;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Recent Trends in Information Systems (ReTIS), 2015 IEEE 2nd International Conference on
Conference_Location :
Kolkata
Type :
conf
DOI :
10.1109/ReTIS.2015.7232847
Filename :
7232847
Link To Document :
بازگشت