• DocumentCode
    3588936
  • Title

    Checksumming Strategies for Data in Volatile Memories

  • Author

    Arafat, Humayun ; Krishnamoorthy, Sriram ; Sadayappan, P.

  • Author_Institution
    Dept. Comp. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
  • fYear
    2014
  • Firstpage
    245
  • Lastpage
    254
  • Abstract
    The increase in the number of processors needed to build exascale systems implies that the mean time between failure will further decrease, making it increasingly important to develop scalable techniques for fault tolerance. In this paper we develop an efficient checksum-based approach to fault tolerance for data in volatile memory systems, i.e., an approach without the need to save any data on stable persistent storage. The developed scheme is applicable in multiple scenarios, including: 1) online recovery of large read-only data structures from the memory of failed nodes, with very low storage overhead 2) online recovery from soft errors in blocked data, and 3) online recovery of read/write data via in-memory checkpointing. The approach uses a logical multi-dimensional view of the data to be protected. Changing the dimensionality of the data view enables a trade-off between multiple factors, including the storage overheads, the checksum generation time, the failure recovery time, and the number of faults that can be tolerated. Experimental results demonstrating effectiveness of the approach are presented on a Cray XE6 system.
  • Keywords
    checkpointing; data handling; data structures; fault tolerant computing; storage management; Cray XE6 system; blocked data; checksum generation time; data checksumming strategies; data protection; data view dimensionality; exascale system; failed node memory; failure recovery time; fault tolerance; in-memory checkpointing; large read-only data structure recovery; logical multidimensional view; online soft error recovery; read-write data; storage overhead; volatile memory system; Computational modeling; Data structures; Fault tolerance; Fault tolerant systems; Indexes; Mathematical model; Program processors; Data; Faulr Tolerance; In memory checkpoint;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel Processing Workshops (ICCPW), 2014 43rd International Conference on
  • ISSN
    1530-2016
  • Type

    conf

  • DOI
    10.1109/ICPPW.2014.41
  • Filename
    7103459