DocumentCode :
3570860
Title :
Text classification for automatic detection of alcohol use-related tweets: A feasibility study
Author :
Aphinyanaphongs, Yin ; Ray, Bisakha ; Statnikov, Alexander ; Krebs, Paul
Author_Institution :
NYU Langone Med. Center, New York, NY, USA
fYear :
2014
Firstpage :
93
Lastpage :
97
Abstract :
We present a feasibility study using text classification to classify tweets about alcohol use. Alcohol use is the most widely used substance in the US and is the leading risk factor for premature morbidity and mortality globally. Understanding use patterns and locations is an important step toward prevention, moderation, and control of alcohol outlets. Social media may provide an alternate way to measure alcohol use in real time. This feasibility study explores text classification methodologies for identifying alcohol use tweets. We labeled 34,563 geo-located New York City tweets collected in a 24 hour period over New Year´s Day 2012. We preprocessed the tweets into stem/ not stemmed and unigram/ bigram representations. We then applied multinomial naïve Bayes, a linear SVM, Bayesian logistic regression, and random forests to the classification task. Using 10 fold cross-validation, the algorithms performed with area under the receiver operating curve of 0.66, 0.91, 0.93, and 0.94 respectively. We also compare to a human constructed Boolean search for the same tweets and the text classification method is competitive with this hand crafted search. In conclusion, we show that the task of automatically identifying alcohol related tweets is highly feasible and paves the way for future research to improve these classifiers.
Keywords :
belief networks; learning (artificial intelligence); pattern classification; regression analysis; social networking (online); social sciences computing; support vector machines; text analysis; Bayesian logistic regression; Boolean search; New York City; alcohol outlet control; alcohol outlet moderation; alcohol outlet prevention; alcohol use-related tweet detection; area under the receiver operating curve; bigram representation; cross-validation; linear SVM; linear support vector machines; multinomial naive Bayes; not stemmed representation; premature morbidity; premature mortality; random forests; stem representation; text classification; tweet classification; unigram representation; Bayes methods; Classification algorithms; Logistics; Media; Support vector machines; Text categorization; Twitter; alcohol use; social media; text categorization; text classification; twitter;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Reuse and Integration (IRI), 2014 IEEE 15th International Conference on
Type :
conf
DOI :
10.1109/IRI.2014.7051877
Filename :
7051877
Link To Document :
بازگشت