• DocumentCode
    2869589
  • Title

    Handling Language Variations in Open Source Bug Reporting Systems

  • Author

    Banerjee, Sean ; Musgrove, Jesse ; Cukic, Bojan

  • Author_Institution
    Lane Dept. of Comput. Sci. & Electr. Eng., West Virginia Univ., Morgantown, WV, USA
  • fYear
    2012
  • fDate
    27-30 Nov. 2012
  • Firstpage
    325
  • Lastpage
    330
  • Abstract
    Natural language plays a critical role in the design, development and maintenance of software systems. For example, bug reporting systems allow users to submit reports describing observed anomalies in free form English. However, the free form aspect makes the detection of duplicate reports a challenge due to the breadth and diversity of language used by individual reporters. Tokenization, stemming and stop word removal are commonly used techniques to normalize and reduce the language space. However, the impact of typographical errors and alternate spellings has not been analyzed in the research literature. Our research indicates that handling language problems during automated bug triage analysis can lead to a boost in performance. We show that the language used in software problem reporting is too specialized to benefit from domain independent spell checkers or lexical databases. Therefore, we present a novel approach using word distance and neighbor word likelihood measures for detecting and resolving language-based issues in open-source software problem reporting. We evaluate our approach using the complete Firefox repository until March 2012. Our results indicate measurable improvements in duplicate detection results, while reducing the language space for most frequently used words by 30%. Moreover, our method is language-agnostic and does not require a pre-built dictionary, thus making it suitable for use in a variety of systems.
  • Keywords
    computational linguistics; natural language processing; program verification; public domain software; software development management; software maintenance; spelling aids; automated bug triage analysis; language variation handling; lexical database; natural language; neighbor word likelihood measure; open source bug reporting system; open source software; resolving language-based issue detection; software development; software maintenance; software system design; spelling checker; stemming technique; stop word removal; tokenization technique; word distance; Color; Context; Databases; Dictionaries; Frequency measurement; Image color analysis; Software; Alternate Spellings; Duplicate Bug Reports; Software Maintenance; Software Reliability; String Algorithms; Typographical Errors;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Software Reliability Engineering Workshops (ISSREW), 2012 IEEE 23rd International Symposium on
  • Conference_Location
    Dallas, TX
  • Print_ISBN
    978-1-4673-5048-8
  • Type

    conf

  • DOI
    10.1109/ISSREW.2012.85
  • Filename
    6405465