Title :
Distributed Replica-Exchange Simulations on Production Environments Using SAGA and Migol
Author :
Luckow, André ; Jha, Shantenu ; Kim, Joohyun ; Merzky, Andre ; Schnor, Bettina
Author_Institution :
Inst. of Comput. Sci., Potsdam Univ., Potsdam, Germany
Abstract :
There exists a class of scientific applications for which utilizing distributed resources is critical for reducing the time-to-solution. In this paper, we discuss a specific class of applications - Replica-Exchange simulations - where the orchestration of many distributed jobs in a dynamic and inherently unreliable distributed environment is essential for a successful completion. We describe the design, development and deployment of a unique framework for constructing fault-tolerant distributed simulations. The framework consists of two primary components - SAGA and Migol. SAGA is a high-level programmatic abstraction layer that provides a standardised interface for the primary distributed functionality required for application development. We present details of a newly developed functionality in SAGA - the Checkpoint and Recovery (CPR) API. Migol is an adaptive middleware, which supports the fault-tolerance of distributed applications by providing the capability to recover applications from checkpoint files transparently. In addition to describing the integration of SAGA-CPR with the Migol infrastructure, we outline our experiences with running a large scale, general-purpose Replica-Exchange application in a production distributed environment.
Keywords :
checkpointing; data structures; middleware; software fault tolerance; API; Migol; SAGA; adaptive middleware; distributed replica-exchange simulations; fault-tolerant distributed simulations; high-level programmatic abstraction; production environments; Application software; Computational modeling; Computer science; Computer simulation; Distributed computing; Fault tolerance; Middleware; Packaging; Production; USA Councils; Fault-Tolerance; Grid Computing; Migol; Replica-Exchange; SAGA;
Conference_Titel :
eScience, 2008. eScience '08. IEEE Fourth International Conference on
Conference_Location :
Indianapolis, IN
Print_ISBN :
978-1-4244-3380-3
Electronic_ISBN :
978-0-7695-3535-7
DOI :
10.1109/eScience.2008.20