• DocumentCode
    560144
  • Title

    Evaluating the viability of process replication reliability for exascale systems

  • Author

    Ferreira, Kurt ; Stearley, Jon ; Laros, James H., III ; Oldfield, Ron ; Pedretti, Kevin ; Brightwell, Ron ; Riesen, Rolf ; Bridges, Patrick G. ; Arnold, Dorian

  • Author_Institution
    Scalable Syst. Software Dept., Sandia Nat. Labs., Albuquerque, NM, USA
  • fYear
    2011
  • fDate
    12-18 Nov. 2011
  • Firstpage
    1
  • Lastpage
    12
  • Abstract
    As high-end computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are increasingly problematic at these scales due to excessive overheads predicted to more than double an application´s time to solution. Replicated computing techniques, particularly state machine replication, long used in distributed and mission critical systems, have been suggested as an alternative to checkpoint-restart. In this paper, we evaluate the viability of using state machine replication as the primary fault tolerance mechanism for upcoming exascale systems. We use a combination of modeling, empirical analysis, and simulation to study the costs and benefits of this approach in comparison to check-point/restart on a wide range of system parameters. These results, which cover different failure distributions, hardware mean time to failures, and I/O bandwidths, show that state machine replication is a potentially useful technique for meeting the fault tolerance demands of HPC applications on future exascale platforms.
  • Keywords
    checkpointing; distributed processing; fault tolerant computing; finite state machines; HPC applications; checkpoint-restart; distributed systems; exascale systems; failure distribution; fault tolerance mechanism; high-end computing machines; mission critical systems; process replication reliability; replicated computing techniques; state machine replication; Bandwidth; Computer crashes; Fault tolerance; Fault tolerant systems; Hardware; Protocols; Sockets;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for
  • Conference_Location
    Seatle, WA
  • Electronic_ISBN
    978-1-4503-0771-0
  • Type

    conf

  • Filename
    6114406