• DocumentCode
    2570122
  • Title

    Evaluation of checkpoint mechanisms for massively parallel machines

  • Author

    Chiueh, Tzi-cker ; Deng, Peitao

  • Author_Institution
    Dept. of Comput. Sci., State Univ. of New York, Stony Brook, NY, USA
  • fYear
    1996
  • fDate
    25-27 Jun 1996
  • Firstpage
    370
  • Lastpage
    379
  • Abstract
    Massively parallel machines typically contain thousands of processor units and therefore are more likely to suffer system breakdown because of component failures. This paper studies efficient diskless checkpointing mechanisms for SIMD massively parallel machines. Three checkpointing schemes: mirror checkpointing, parity checkpointing, and partial parity checkpointing are compared in terms of their checkpoint performance and storage overheads, based on empirical measurements. Mirror checkpointing and parity checkpointing schemes have been successfully implemented and tested on a DECmpp 12000 machine, without hardware or OS modifications. It has been shown that mirror checkpointing is an order of magnitude faster than parity checkpointing, but takes twice as much storage overhead. Partial parity checkpointing, although significantly reduces the storage overhead, could lead to unpredictable execution performance. This paper also examines the detailed storage/performance tradeoffs for partial parity checkpointing through manual instrumentation, and describes the implementation experience from these experiments
  • Keywords
    DEC computers; fault tolerant computing; parallel algorithms; parallel machines; performance evaluation; system recovery; DECmpp 12000 machine; SIMD; checkpoint performance; checkpointing schemes; component failure; diskless checkpointing; massively parallel machines; mirror checkpointing; parity checkpointing; partial parity checkpointing; storage overhead; storage performance tradeoffs; system breakdown; unpredictable execution performance; Batteries; Checkpointing; Computer science; Concurrent computing; Electric breakdown; Hardware; Instruments; Mirrors; Parallel machines; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fault Tolerant Computing, 1996., Proceedings of Annual Symposium on
  • Conference_Location
    Sendai
  • ISSN
    0731-3071
  • Print_ISBN
    0-8186-7262-5
  • Type

    conf

  • DOI
    10.1109/FTCS.1996.534622
  • Filename
    534622