• DocumentCode
    731520
  • Title

    Mining StackOverflow to Filter Out Off-Topic IRC Discussion

  • Author

    Chowdhury, Shaiful Alam ; Hindle, Abram

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Alberta, Edmonton, AB, Canada
  • fYear
    2015
  • fDate
    16-17 May 2015
  • Firstpage
    422
  • Lastpage
    425
  • Abstract
    Internet Relay Chat (IRC) is a commonly used tool by Open Source developers. Developers use IRC channels to discuss programming related problems, but much of the discussion is irrelevant and off-topic. Essentially if we treat IRC discussions like email messages, and apply spam filtering, we can try to filter out the spam (the off-topic discussions) from the ham (the programming discussions). Yet we need labelled data that unfortunately takes time to curate. To avoid costly cur ration in order to filter out off-topic discussions, we need positive and negative data-sources. On-line discussion forums, such as Stack Overflow, are very effective for solving programming problems. By engaging in open-data, Stack Overflow data becomes a powerful source of labelled text regarding programming. This work shows that we can train classifiers using Stack Overflow posts as positive examples of on-topic programming discussion. You Tube video comments, notorious for their lack of quality, serve as training set of off-topic discussion. By exploiting these datasets, accurate classifiers can be built, tested and evaluated that require very little effort for end-users to deploy and exploit.
  • Keywords
    Internet; data mining; e-mail filters; information filtering; social networking (online); software tools; IRC channels; Internet Relay Chat; OpenSource developers; YouTube video comments; email messages; labelled text source; off-topic IRC discussion; programming related problems; spam filtering; stackoverflow mining; Accuracy; Data mining; Mathematical model; Programming; Support vector machines; Training; YouTube; IRC message filtering; Naive Bayes; SVM; Stackoverflow mining; Text classification; YouTube video comments;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Mining Software Repositories (MSR), 2015 IEEE/ACM 12th Working Conference on
  • Conference_Location
    Florence
  • Type

    conf

  • DOI
    10.1109/MSR.2015.54
  • Filename
    7180108