Title :
Large-Scale Heterogeneous Program Retrieval through Frequent Pattern Discovery and Feature Correlation Analysis
Author :
Bo Liu ; Liang Wu ; Qiuxiang Dong ; Yuanchun Zhou
Author_Institution :
NEC Labs. China, Beijing, China
fDate :
June 27 2014-July 2 2014
Abstract :
In the era of big data, information retrieval becomes even more challenging since the size of data volume is emerging fast and it is difficult to find the right information from the huge amount of heterogeneous datasets. Especially in software engineering domain, it tends to be more difficult to retrieve the right program from projects that are written in different languages and not well-developed. Prior work solved this problem by extracting words from programs, which cannot fully exploit the information of source code. In this paper, we propose a novel program retrieval method by extracting the frequent patterns and analyzing their correlations with accompanying text information. The experimental results on large-scale and heterogeneous datasets validate the effectiveness of our proposed approach. The inferred semantics of programs can significantly improve the accuracy of code artifact retrieval.
Keywords :
Big Data; correlation methods; data mining; distributed databases; feature extraction; information retrieval; pattern recognition; software engineering; source code (software); text analysis; big data; code artifact retrieval; data volume size; feature correlation analysis; frequent pattern discovery; frequent pattern extraction; heterogeneous datasets; information retrieval; large-scale heterogeneous program retrieval; program retrieval method; software engineering domain; source code; text information; word extraction; Big data; Computational modeling; Correlation; Data mining; Feature extraction; Java; Semantics; Information retrieval; big data; data mining; semantics;
Conference_Titel :
Big Data (BigData Congress), 2014 IEEE International Congress on
Conference_Location :
Anchorage, AK
Print_ISBN :
978-1-4799-5056-0
DOI :
10.1109/BigData.Congress.2014.120