Title :
Topic-specific post identification in microblog streams
Author :
Karunasekera, Shanika ; Harwood, Aaron ; Samarawickrama, Sameendra ; Ramamohanarao, Kotagiri ; Robins, Garry
Author_Institution :
Dept. of Comput. & Inf. Syst., Univ. of Melbourne, Melbourne, VIC, Australia
Abstract :
The tracking of microblog discussion, on a given topic, is useful for a wide range of higher level applications. Microblog services like Twitter provide a simple keyword based tracking capability, where any tweet containing a keyword is returned. Due to the short length of microblog posts, using a small number of topic specific query words for tracking, would impact recall. Use of a larger number of keywords (compared to regular document retrieval) is generally required in order to obtain good recall, but this would result in a large number of off-topic posts, resulting in low precision. In our work, we consider the scenario of using a large number of query terms to maintain high recall, for automated tracking of a microblog streams. The challenge we address is how to score each of the returned microblogs, with respect to the query, on-line, in an unsupervised manner, so as to identify those that are on topic. To this end, we proposed a new term-scoring expression, which we call Adjusted Information Gain (AIG), and we compare this to other term-scoring expressions: inverse document frequency, Dice, Jaccard and keyword frequency. Our comparisons consider a selection of document-scoring functions applied to roughly 40 million tweets collects over a 20 day period for each of two topics. Our results show significant improvements (from 8%-40% of the area under the ROC curves) to existing term-scoring expressions, depending on topic and specificity, and provide insight into further work in query expansion techniques.
Keywords :
document handling; query processing; social networking (online); AIG; Twitter; adjusted information gain; inverse document frequency; keyword based tracking capability; keyword frequency; microblog discussion; microblog posts; microblog streams; off-topic posts; query expansion techniques; query terms; term-scoring expression; term-scoring expressions; topic specific query words; topic-specific post identification; unsupervised manner; Australia; Broadband communication; Context; Twitter; Vectors; Vocabulary; document; keyword; microblog; query; term; topic;
Conference_Titel :
Big Data (Big Data), 2014 IEEE International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/BigData.2014.7004416