• DocumentCode
    2322045
  • Title

    Enabling Application Resilience with and without the MPI Standard

  • Author

    Bland, Wesley

  • Author_Institution
    Innovative Comput. Lab., Univ. of Tennessee, Knoxville, TN, USA
  • fYear
    2012
  • fDate
    13-16 May 2012
  • Firstpage
    746
  • Lastpage
    751
  • Abstract
    As recent research has demonstrated, it is becoming a necessity for large scale applications to have the ability to tolerate process failure during an execution. As the number of processes increases, checkpoint/restart fault tolerance approaches requiring large concurrent state check pointing become untenable and radically new methods to address fault tolerance are needed. This work addresses these challenges by proposing a novel approach to a minimalistic fault discovery and management model. Such a model allows application to run to completion despite fail-stop failures. As a proof of concept, in addition to the proposed fault tolerance model, an implementation in the context of the Open MPI library is provided, evaluated and analyzed.
  • Keywords
    application program interfaces; checkpointing; fault tolerant computing; message passing; MPI standard; Open MPI library; application resilience; concurrent state checkpointing; fail-stop failures; fault tolerance approach; message passing interface; minimalistic fault discovery; proof of concept; runtime process failure; Fault tolerance; Fault tolerant systems; Libraries; Routing; Runtime; Standards; Topology; Distributed Runtime; Fault Tolerance; Message Passing Interface;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster, Cloud and Grid Computing (CCGrid), 2012 12th IEEE/ACM International Symposium on
  • Conference_Location
    Ottawa, ON
  • Print_ISBN
    978-1-4673-1395-7
  • Type

    conf

  • DOI
    10.1109/CCGrid.2012.25
  • Filename
    6217505