• DocumentCode
    2205789
  • Title

    FLOSS as a Source for Profanity and Insults: Collecting the Data

  • Author

    Squire, Megan ; Gazda, Rebecca

  • fYear
    2015
  • fDate
    5-8 Jan. 2015
  • Firstpage
    5290
  • Lastpage
    5298
  • Abstract
    An important task in machine learning and natural language processing is to learn to recognize different types of human speech, including humor, sarcasm, insults, and profanity. In this paper we describe our method to produce test and training data sets to assist in this task. Our test data sets are taken from the domain of free, libre, and open source software (FLOSS) development communities. We describe our process in constructing helper sets of relevant data, such as profanity lists, lists of insults, and lists of projects with their codes of conduct. Contributions of this paper are to describe the background literature on computer-aided methods of recognizing insulting or profane speech, to describe the parameters of data sets that are useful in this work, and to outline how FLOSS communities are such a rich source of insulting or profane speech data. We then describe our data sets in detail, including how we created these data sets, and provide some initial guidelines for usage.
  • Keywords
    learning (artificial intelligence); natural language processing; public domain software; speech recognition; FLOSS development; computer-aided method; free libre and open source software; insult list; machine learning; natural language processing; profanity list; speech recognition; Communities; Electronic mail; Kernel; Linux; Media; Speech; Speech recognition; data set; dialogue; free software; insults; irc; linux; mailing list; open source; profanity;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    System Sciences (HICSS), 2015 48th Hawaii International Conference on
  • Conference_Location
    Kauai, HI
  • ISSN
    1530-1605
  • Type

    conf

  • DOI
    10.1109/HICSS.2015.623
  • Filename
    7070451