• DocumentCode
    1998938
  • Title

    Performance Driven Partial Checkpoint/Migrate for LAM-MPI

  • Author

    Singh, Rajendra ; Graham, Peter

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Manitoba, Winnipeg, MB
  • fYear
    2008
  • fDate
    9-11 June 2008
  • Firstpage
    110
  • Lastpage
    116
  • Abstract
    Using idle compute resources is cost-effective and systems like Condor have successfully exploited such resources in limited contexts (e.g. bag of tasks problems). Increasingly, networks in large organizations are becoming more capable and, when combined with latency tolerance mechanisms, can now provide an attractive platform for running some cluster-based parallel programs. In environments where machines are shared, however, load guarantees cannot be made. If one or more machines running an application become overloaded it may negatively impact the performance of the entire application. This provides a strong motivation to be able to checkpoint and migrate processes to new machines. Such performance driven migration normally involves the entire set of application processes. This, however, is wasteful both in terms of lost progress (if other processes can still execute) and overhead (since moving unnecessary processes is costly). To address these issues, we describe an extension of LAM/MPI that provides a partial checkpoint and migrate ability. Our system checkpoints only the subset of MPIprocesses that need to migrate. For long running applications exhibiting moderate communications, this can enhance the usefulness of shared machines for "cluster" computing.
  • Keywords
    checkpointing; message passing; parallel programming; Condor; LAM-MPI; MPI processes; cluster based parallel programs; cluster computing; latency tolerance mechanisms; performance driven checkpoint; performance driven migration; Application software; Availability; Checkpointing; Computer networks; Computer science; Delay; High performance computing; Resource management; Signal processing; Throughput; Checkpoint; Cluster Computing; Grids; MPI; Migration;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing Systems and Applications, 2008. HPCS 2008. 22nd International Symposium on
  • Conference_Location
    Quebec City, Que.
  • ISSN
    1550-5243
  • Print_ISBN
    978-0-7695-3250-9
  • Type

    conf

  • DOI
    10.1109/HPCS.2008.16
  • Filename
    4556085