Sentiment analysis using cosine similarity measure

Author

Bhattacharjee, Saprativa ; Das, Anirban ; Bhattacharya, Ujjwal ; Parui, Swapan K. ; Roy, Sudipta

Author_Institution

Dept. of Inf. Technol., Sikkim Manipal Inst. of Technol., Majitar, India

fYear

2015

fDate

9-11 July 2015

Firstpage

Lastpage

Abstract

The opinion of other people is often a major factor influencing our decisions. For a consumer it affects purchase decisions and for a producer or a service provider it helps in making business decisions. Companies spend a lot of money and time on surveys for gathering the public opinion on products and services. Now-a-days the web has become a hotspot for finding user opinions on almost anything under the sun. Both money and time can be saved by mining opinions from the web. Moreover, no survey can have a sample size, which can match that of the web. Each opinion generally expresses either positive, negative or neutral sentiment. The task of identifying these sentiments is called Sentiment Analysis. This work deals with the analysis of user sentiments in the Telecom domain. Since no such related standard database of users´ opinions could be found, we developed one by mining the WWW. A major issue with these sample comments is that these are usually extremely noisy, containing numerous spelling and grammatical errors, acronyms, abbreviations, shortened or slang words etc. Such data cannot be used directly for analyzing sentiments. Hence, a lexicon based preprocessing algorithm is proposed for noise reduction. A novel idea based on Cosine Similarity measure is proposed for classifying the sentiment expressed by a user´s comment into a five point scale of -2 (highly negative) to +2 (highly positive). The performance of the proposed strategy is compared with some of the well-known machine learning algorithms namely, Naive Bayes, Maximum Entropy and SVM. The proposed Cosine Similarity based classifier gives 82.09% accuracy for the 2-class problem of identifying positive and negative sentiments. It outperforms all other classifiers by a considerable margin in the 5-class sentiment classification problem with an accuracy of 71.5%. The same strategy is also used for categorizing each user comment into six different Telecom specific categories.

Keywords

Bayes methods; Internet; data mining; learning (artificial intelligence); maximum entropy methods; natural language processing; pattern classification; support vector machines; telecommunication services; text analysis; SVM; abbreviations; acronyms; business decision making; cosine similarity measure; grammatical errors; lexicon based preprocessing algorithm; machine learning algorithms; maximum entropy; naive Bayes; noise reduction; opinion mining; public opinion; purchase decision; sentiment analysis; sentiment classification; sentiment identification; service provider; shortened words; slang words; spelling errors; telecom domain; user comment; user opinion; Accuracy; Dictionaries; Frequency conversion; Sentiment analysis; Support vector machines; Telecommunications; Training;

fLanguage

English

Publisher

ieee

Conference_Titel

Recent Trends in Information Systems (ReTIS), 2015 IEEE 2nd International Conference on

Conference_Location

Kolkata

Type

conf

DOI

10.1109/ReTIS.2015.7232847

Filename

7232847

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=1994281