• DocumentCode
    580105
  • Title

    Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems

  • Author

    Xiangyu Dong ; Muralimanohar, Naveen ; Jouppi, N. ; Kaufmann, Richard ; Yuan Xie

  • fYear
    2009
  • fDate
    14-20 Nov. 2009
  • Firstpage
    1
  • Lastpage
    12
  • Abstract
    The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that can take more frequent checkpoints with minimum overhead are critical to implement a reliable exascale system. In this work, we leverage the upcoming Phase-Change Random Access Memory (PCRAM) technology and propose a hybrid local/global checkpointing mechanism after a thorough analysis of MPP systems failure rates and failure sources. We propose three variants of PCRAM-based hybrid checkpointing schemes, DIMM+HDD, DIMM+DIMM, and 3D+3D, to reduce the checkpoint overhead and offer a smooth transition from the conventional pure HDD checkpoint to the ideal 3D PCRAM mechanism. The proposed pure 3D PCRAM-based mechanism can ultimately take checkpoints with overhead less than 4% on a projected exascale system.
  • Keywords
    checkpointing; disc drives; hard discs; parallel processing; phase change memories; 3D PCRAM technology; 3D+3D; DIMM+DIMM; DIMM+HDD; HDD checkpointing; MPP system; PCRAM-based hybrid checkpointing; checkpoint frequency; checkpoint overhead; exascale system; failure rate; hard disk drive; massively parallel processing; node counts; phase-change random access memory;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing Networking, Storage and Analysis, Proceedings of the Conference on
  • Conference_Location
    Portland, OR
  • Type

    conf

  • DOI
    10.1145/1654059.1654117
  • Filename
    6375553