Author :
Glimcher, Leonid ; Jin, Ruoming ; Agrawal, Gagan
Author_Institution :
Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH
Abstract :
Analysis of large geographically distributed scientific datasets, also referred to as distributed data-intensive science, has emerged as an important area in recent years. An application that processes data from a remote repository needs to be broken into several stages, including a data retrieval task at the data repository, a data movement task, and a data processing task at a computing site. Because of the volume of data that is involved and the amount of processing, it is desirable that both the data repository and computing site may be clusters. This can further complicate the development of such data processing applications. In this paper, we present a middleware, FREERIDE-G (framework for rapid implementation of datamining engines in grid), which support a high-level interface for developing data mining and scientific data processing applications that involve data stored in remote repositories. Particularly, we had the following goals behind designing the FREERIDE-G middleware: 1) support high-end processing, i.e., use parallel configurations for both hosting the data and processing the data, 2) ease use of parallel configurations, i.e., support a high-level API for specifying the processing, and 3) hide details of data movement and caching. We have evaluated our system using three popular data mining algorithms and two scientific data analysis applications. The main observations from our experiments are as follows. First, FREERIDE-G is able to scale the processing extremely well when the number of data server and compute nodes are scaled evenly. Second, when only the number of compute nodes are scaled, our target class of applications achieve modest additional speedups. Finally, for applications that involve multiple passes on the dataset, caching remote data provides significant improvement
Keywords :
data mining; grid computing; middleware; natural sciences computing; very large databases; FREERIDE-G middleware; high-end processing; large geographically distributed scientific datasets; parallel configurations; remote data repositories mining; scientific data analysis; scientific data processing; Application software; Computational modeling; Computer science; Data analysis; Data engineering; Data mining; Data processing; Engines; Information retrieval; Middleware;