• DocumentCode
    2194252
  • Title

    Distributed Replica-Exchange Simulations on Production Environments Using SAGA and Migol

  • Author

    Luckow, André ; Jha, Shantenu ; Kim, Joohyun ; Merzky, Andre ; Schnor, Bettina

  • Author_Institution
    Inst. of Comput. Sci., Potsdam Univ., Potsdam, Germany
  • fYear
    2008
  • fDate
    7-12 Dec. 2008
  • Firstpage
    253
  • Lastpage
    260
  • Abstract
    There exists a class of scientific applications for which utilizing distributed resources is critical for reducing the time-to-solution. In this paper, we discuss a specific class of applications - Replica-Exchange simulations - where the orchestration of many distributed jobs in a dynamic and inherently unreliable distributed environment is essential for a successful completion. We describe the design, development and deployment of a unique framework for constructing fault-tolerant distributed simulations. The framework consists of two primary components - SAGA and Migol. SAGA is a high-level programmatic abstraction layer that provides a standardised interface for the primary distributed functionality required for application development. We present details of a newly developed functionality in SAGA - the Checkpoint and Recovery (CPR) API. Migol is an adaptive middleware, which supports the fault-tolerance of distributed applications by providing the capability to recover applications from checkpoint files transparently. In addition to describing the integration of SAGA-CPR with the Migol infrastructure, we outline our experiences with running a large scale, general-purpose Replica-Exchange application in a production distributed environment.
  • Keywords
    checkpointing; data structures; middleware; software fault tolerance; API; Migol; SAGA; adaptive middleware; distributed replica-exchange simulations; fault-tolerant distributed simulations; high-level programmatic abstraction; production environments; Application software; Computational modeling; Computer science; Computer simulation; Distributed computing; Fault tolerance; Middleware; Packaging; Production; USA Councils; Fault-Tolerance; Grid Computing; Migol; Replica-Exchange; SAGA;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    eScience, 2008. eScience '08. IEEE Fourth International Conference on
  • Conference_Location
    Indianapolis, IN
  • Print_ISBN
    978-1-4244-3380-3
  • Electronic_ISBN
    978-0-7695-3535-7
  • Type

    conf

  • DOI
    10.1109/eScience.2008.20
  • Filename
    4736765