DocumentCode :
1659009
Title :
A compilation framework for distributed memory parallelization of data mining algorithms
Author :
Li, Xiaogang ; Jin, Ruoming ; Agrawal, Gagan
Author_Institution :
Dept. of Comput. & Inf. Sci., Ohio State Univ., Columbus, OH, USA
fYear :
2003
Abstract :
With the availability of large datasets in a variety of scientific and commercial domains, data mining has emerged as an important area within the last decade. Data mining techniques focus on finding novel and useful patterns or models from large datasets. Because of the volume of the data to be analyzed, the amount of computation involved, and the need for rapid or even interactive analysis, data mining applications require the use of parallel machines. We believe that parallel compilation technology can be used for providing high-level language support for carrying out data mining implementations. Our study of a variety of popular data mining techniques has shown that they can be parallelized in a similar fashion. In our previous work, we have developed a middleware system that exploits this similarity to support distributed memory parallelization and execution on disk-resident datasets. This paper focuses on developing a data parallel language interface for using our middleware´s functionality. We use a data parallel dialect of Java and show that it is well suited for data mining algorithms. Compiler techniques for translating this dialect to a middleware specification are presented. The most significant of these is a new technique for extracting a global reduction function from a data parallel loop. We present a detailed experimental evaluation of our compiler using a priori association mining, k-means clustering, and k-nearest neighbor classifiers. Our experimental results show that: 1) compiler generated parallel data mining codes achieve high speedups in a cluster environment, 2) the performance of compiler generated codes is quite close to the performance of manually written codes, and 3) simple additional optimizations like inlining can further reduce the gap between compiled and manual codes.
Keywords :
Java; data mining; distributed memory systems; middleware; parallel languages; parallelising compilers; pattern clustering; program control structures; software performance evaluation; very large databases; Java; a priori association mining; compilation framework; data mining algorithms; data parallel language interface; data parallel loop; disk-resident datasets; distributed memory parallelization; global reduction function; high-level language support; inlining; k-means clustering; k-nearest neighbor classifiers; large datasets; middleware system; parallel compilation; parallel machines; performance; speedups; Clustering algorithms; Concurrent computing; Data analysis; Data mining; High level languages; Java; Middleware; Optimizing compilers; Parallel languages; Parallel machines;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium, 2003. Proceedings. International
ISSN :
1530-2075
Print_ISBN :
0-7695-1926-1
Type :
conf
DOI :
10.1109/IPDPS.2003.1213080
Filename :
1213080
Link To Document :
بازگشت