• DocumentCode
    442932
  • Title

    Process migration for MPI applications based on coordinated checkpoint

  • Author

    Cao, Jiannong ; Li, Yinghao ; Guo, Minyi

  • Author_Institution
    Dept. of Comput., Hong Kong Polytech. Univ., Hong Kong, China
  • Volume
    1
  • fYear
    2005
  • fDate
    20-22 July 2005
  • Firstpage
    306
  • Abstract
    A lot of research has been done on fault-tolerance for MPI applications, some on checkpoint/restart, and some on network fault-tolerance. Process migration, however, has not gained widespread use due to the additional complexity of the requirement that the knowledge about the new location of a migrated process has to be made known to every other process in the application. Here we present a simple yet effective method of process migration based on coordinated checkpointing of MPI applications. Migration is achieved by checkpointing the application, modifying the process location information in the checkpoint files, and restarting the application. Checkpoint/restart and migration are transparent to MPI applications. Performance evaluation results showed that the additional checkpoint/restart capability has little impact on application performance, and the migration method scales well on a large number of nodes.
  • Keywords
    application program interfaces; checkpointing; message passing; software fault tolerance; software performance evaluation; MPI applications; checkpoint files; coordinated checkpoint; network fault-tolerance; performance evaluation; process location information modification; process migration; Application software; Buildings; Checkpointing; Cities and towns; Computer architecture; Computer networks; Libraries; Message passing; Parallel processing; Prototypes; MPI; checkpoint/restart; coordinated checkpoint; process migration;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Systems, 2005. Proceedings. 11th International Conference on
  • ISSN
    1521-9097
  • Print_ISBN
    0-7695-2281-5
  • Type

    conf

  • DOI
    10.1109/ICPADS.2005.241
  • Filename
    1531143