• DocumentCode
    2161687
  • Title

    6 million spam tweets: A large ground truth for timely Twitter spam detection

  • Author

    Chen, Chao ; Zhang, Jun ; Chen, Xiao ; Xiang, Yang ; Zhou, Wanlei

  • Author_Institution
    School of Information Technology, Deakin University, Victoria 3125, Australia
  • fYear
    2015
  • fDate
    8-12 June 2015
  • Firstpage
    7065
  • Lastpage
    7070
  • Abstract
    Twitter has changed the way of communication and getting news for people´s daily life in recent years. Meanwhile, due to the popularity of Twitter, it also becomes a main target for spamming activities. In order to stop spammers, Twitter is using Google SafeBrowsing to detect and block spam links. Despite that blacklists can block malicious URLs embedded in tweets, their lagging time hinders the ability to protect users in real-time. Thus, researchers begin to apply different machine learning algorithms to detect Twitter spam. However, there is no comprehensive evaluation on each algorithms´ performance for real-time Twitter spam detection due to the lack of large groundtruth. To carry out a thorough evaluation, we collected a large dataset of over 600 million public tweets. We further labelled around 6.5 million spam tweets and extracted 12 light-weight features, which can be used for online detection. In addition, we have conducted a number of experiments on six machine learning algorithms under various conditions to better understand their effectiveness and weakness for timely Twitter spam detection. We will make our labelled dataset for researchers who are interested in validating or extending our work.
  • Keywords
    Chaos; Gold; Support vector machines;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Communications (ICC), 2015 IEEE International Conference on
  • Conference_Location
    London, United Kingdom
  • Type

    conf

  • DOI
    10.1109/ICC.2015.7249453
  • Filename
    7249453