Title :
Detecting Duplicate Bug Report Using Character N-Gram-Based Features
Author :
Sureka, Ashish ; Jalote, Pankaj
Author_Institution :
Indraprastha Inst. of Inf. Technol. (IIIT), New Delhi, India
fDate :
Nov. 30 2010-Dec. 3 2010
Abstract :
We present an approach to identify duplicate bug reports expressed in free-form text. Duplicate reports needs to be identified to avoid a situation where duplicate reports get assigned to multiple developers. Also, duplicate reports can contain complementary information which can be useful for bug fixing. Automatic identification of duplicate reports (from thousands of existing reports in a bug repository) can increase the productivity of a Triager by reducing the amount of time a Triager spends in searching for duplicate bug reports of any incoming report. The proposed method uses character N-gram-based model for the task of duplicate bug report detection. Previous approaches are word-based whereas this study investigates the usefulness of low-level features based on characters which have certain inherent advantages (such as natural-language independence, robustness towards noisy data and effective handling of domain specific term variations) over word-based features for the problem of duplicate bug report detection. The proposed solution is evaluated on a publicly-available dataset consisting of more than 200 thousand bug reports from the open-source Eclipse project. The dataset consists of ground-truth (pre-annotated dataset having bug reports tagged as duplicate by the Triager). Empirical results and evaluation metrics quantifying retrieval performance indicate that the approach is effective.
Keywords :
information retrieval; program debugging; program testing; public domain software; software maintenance; text analysis; Triager; automatic identification; character n-gram based features; duplicate bug report detection; evaluation metrics; free form text; ground truth; low-level features; open source Eclipse project; publicly available dataset; retrieval performance; Bug Report Analysis; Duplicate Bug Detection; Maintenance; Software Engineering Task Automation; Software Testing; Text Classification;
Conference_Titel :
Software Engineering Conference (APSEC), 2010 17th Asia Pacific
Conference_Location :
Sydney, NSW
Print_ISBN :
978-1-4244-8831-5
Electronic_ISBN :
1530-1362
DOI :
10.1109/APSEC.2010.49