6 million spam tweets: A large ground truth for timely Twitter spam detection

Author

Chen, Chao ; Zhang, Jun ; Chen, Xiao ; Xiang, Yang ; Zhou, Wanlei

Author_Institution

School of Information Technology, Deakin University, Victoria 3125, Australia

fYear

2015

fDate

8-12 June 2015

Firstpage

7065

Lastpage

7070

Abstract

Twitter has changed the way of communication and getting news for people´s daily life in recent years. Meanwhile, due to the popularity of Twitter, it also becomes a main target for spamming activities. In order to stop spammers, Twitter is using Google SafeBrowsing to detect and block spam links. Despite that blacklists can block malicious URLs embedded in tweets, their lagging time hinders the ability to protect users in real-time. Thus, researchers begin to apply different machine learning algorithms to detect Twitter spam. However, there is no comprehensive evaluation on each algorithms´ performance for real-time Twitter spam detection due to the lack of large groundtruth. To carry out a thorough evaluation, we collected a large dataset of over 600 million public tweets. We further labelled around 6.5 million spam tweets and extracted 12 light-weight features, which can be used for online detection. In addition, we have conducted a number of experiments on six machine learning algorithms under various conditions to better understand their effectiveness and weakness for timely Twitter spam detection. We will make our labelled dataset for researchers who are interested in validating or extending our work.

Keywords

Chaos; Gold; Support vector machines;

fLanguage

English

Publisher

ieee

Conference_Titel

Communications (ICC), 2015 IEEE International Conference on

Conference_Location

London, United Kingdom

Type

conf

DOI

10.1109/ICC.2015.7249453

Filename

7249453