• DocumentCode
    107068
  • Title

    Using Migratable Objects to Enhance Fault Tolerance Schemes in Supercomputers

  • Author

    Meneses, Esteban ; Xiang Ni ; Gengbin Zheng ; Mendes, Celso L. ; Kale, Laxmikant V.

  • Author_Institution
    Center for Simulation & Modeling, Univ. of Pittsburgh, Pittsburgh, PA, USA
  • Volume
    26
  • Issue
    7
  • fYear
    2015
  • fDate
    July 1 2015
  • Firstpage
    2061
  • Lastpage
    2074
  • Abstract
    Supercomputers have seen an exponential increase in their size in the last two decades. Such a high growth rate is expected to take us to exascale in the timeframe 2018-2022. But, to bring a productive exascale environment about, it is necessary to focus on several key challenges. One of those challenges is fault tolerance. Machines at extreme scale will experience frequent failures and will require the system to avoid or overcome those failures. Various techniques have recently been developed to tolerate failures. The impact of these techniques and their scalability can be substantially enhanced by a parallel programming model called migratable objects. In this paper, we demonstrate how the migratable-objects model facilitates and improves several fault tolerance approaches. Our experimental results on thousands of cores suggest fault tolerance schemes based on migratable objects have low performance overhead and high scalability. Additionally, we present a performance model that predicts a significant benefit of using migratable objects to provide fault tolerance at extreme scale.
  • Keywords
    parallel programming; software fault tolerance; fault tolerance schemes; migratable objects; parallel programming model; supercomputers; Computational modeling; Fault tolerance; Fault tolerant systems; Protocols; Runtime; Sockets; Supercomputers; Migratable objects; checkpoint/restart; fault tolerance; message logging; resilience;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2014.2342228
  • Filename
    6862914