• DocumentCode
    2503075
  • Title

    FREERIDE-G: Supporting Applications that Mine Remote FREERIDE-G: Supporting Applications that Mine Remote

  • Author

    Glimcher, Leonid ; Jin, Ruoming ; Agrawal, Gagan

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH
  • fYear
    2006
  • fDate
    14-18 Aug. 2006
  • Firstpage
    109
  • Lastpage
    118
  • Abstract
    Analysis of large geographically distributed scientific datasets, also referred to as distributed data-intensive science, has emerged as an important area in recent years. An application that processes data from a remote repository needs to be broken into several stages, including a data retrieval task at the data repository, a data movement task, and a data processing task at a computing site. Because of the volume of data that is involved and the amount of processing, it is desirable that both the data repository and computing site may be clusters. This can further complicate the development of such data processing applications. In this paper, we present a middleware, FREERIDE-G (framework for rapid implementation of datamining engines in grid), which support a high-level interface for developing data mining and scientific data processing applications that involve data stored in remote repositories. Particularly, we had the following goals behind designing the FREERIDE-G middleware: 1) support high-end processing, i.e., use parallel configurations for both hosting the data and processing the data, 2) ease use of parallel configurations, i.e., support a high-level API for specifying the processing, and 3) hide details of data movement and caching. We have evaluated our system using three popular data mining algorithms and two scientific data analysis applications. The main observations from our experiments are as follows. First, FREERIDE-G is able to scale the processing extremely well when the number of data server and compute nodes are scaled evenly. Second, when only the number of compute nodes are scaled, our target class of applications achieve modest additional speedups. Finally, for applications that involve multiple passes on the dataset, caching remote data provides significant improvement
  • Keywords
    data mining; grid computing; middleware; natural sciences computing; very large databases; FREERIDE-G middleware; high-end processing; large geographically distributed scientific datasets; parallel configurations; remote data repositories mining; scientific data analysis; scientific data processing; Application software; Computational modeling; Computer science; Data analysis; Data engineering; Data mining; Data processing; Engines; Information retrieval; Middleware;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel Processing, 2006. ICPP 2006. International Conference on
  • Conference_Location
    Columbus, OH
  • ISSN
    0190-3918
  • Print_ISBN
    0-7695-2636-5
  • Type

    conf

  • DOI
    10.1109/ICPP.2006.44
  • Filename
    1690611