• DocumentCode
    1994281
  • Title

    Sentiment analysis using cosine similarity measure

  • Author

    Bhattacharjee, Saprativa ; Das, Anirban ; Bhattacharya, Ujjwal ; Parui, Swapan K. ; Roy, Sudipta

  • Author_Institution
    Dept. of Inf. Technol., Sikkim Manipal Inst. of Technol., Majitar, India
  • fYear
    2015
  • fDate
    9-11 July 2015
  • Firstpage
    27
  • Lastpage
    32
  • Abstract
    The opinion of other people is often a major factor influencing our decisions. For a consumer it affects purchase decisions and for a producer or a service provider it helps in making business decisions. Companies spend a lot of money and time on surveys for gathering the public opinion on products and services. Now-a-days the web has become a hotspot for finding user opinions on almost anything under the sun. Both money and time can be saved by mining opinions from the web. Moreover, no survey can have a sample size, which can match that of the web. Each opinion generally expresses either positive, negative or neutral sentiment. The task of identifying these sentiments is called Sentiment Analysis. This work deals with the analysis of user sentiments in the Telecom domain. Since no such related standard database of users´ opinions could be found, we developed one by mining the WWW. A major issue with these sample comments is that these are usually extremely noisy, containing numerous spelling and grammatical errors, acronyms, abbreviations, shortened or slang words etc. Such data cannot be used directly for analyzing sentiments. Hence, a lexicon based preprocessing algorithm is proposed for noise reduction. A novel idea based on Cosine Similarity measure is proposed for classifying the sentiment expressed by a user´s comment into a five point scale of -2 (highly negative) to +2 (highly positive). The performance of the proposed strategy is compared with some of the well-known machine learning algorithms namely, Naive Bayes, Maximum Entropy and SVM. The proposed Cosine Similarity based classifier gives 82.09% accuracy for the 2-class problem of identifying positive and negative sentiments. It outperforms all other classifiers by a considerable margin in the 5-class sentiment classification problem with an accuracy of 71.5%. The same strategy is also used for categorizing each user comment into six different Telecom specific categories.
  • Keywords
    Bayes methods; Internet; data mining; learning (artificial intelligence); maximum entropy methods; natural language processing; pattern classification; support vector machines; telecommunication services; text analysis; SVM; abbreviations; acronyms; business decision making; cosine similarity measure; grammatical errors; lexicon based preprocessing algorithm; machine learning algorithms; maximum entropy; naive Bayes; noise reduction; opinion mining; public opinion; purchase decision; sentiment analysis; sentiment classification; sentiment identification; service provider; shortened words; slang words; spelling errors; telecom domain; user comment; user opinion; Accuracy; Dictionaries; Frequency conversion; Sentiment analysis; Support vector machines; Telecommunications; Training;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Recent Trends in Information Systems (ReTIS), 2015 IEEE 2nd International Conference on
  • Conference_Location
    Kolkata
  • Type

    conf

  • DOI
    10.1109/ReTIS.2015.7232847
  • Filename
    7232847