• DocumentCode
    2515271
  • Title

    Low-overhead diskless checkpoint for hybrid computing systems

  • Author

    Gomez, Leonardo Bautista ; Nukada, Akira ; Maruyama, Naoya ; Cappello, Franck ; Matsuoka, Satoshi

  • Author_Institution
    Tokyo Inst. of Technol., Tokyo, Japan
  • fYear
    2010
  • fDate
    19-22 Dec. 2010
  • Firstpage
    1
  • Lastpage
    10
  • Abstract
    As the size of new supercomputers scales to tens of thousands of sockets, the mean time between failures (MTBF) is decreasing to just several hours and long executions need some kind of fault tolerance method to survive failures. CheckpointRestart is a popular technique used for this purpose; but writing the state of a big scientific application to remote storage will become prohibitively expensive in the near future. Diskless checkpoint was proposed as a solution to avoid the I/O bottleneck of disk-based checkpoint. However, the complex time-consuming encoding techniques hinder its scalability. At the same time, heterogeneous computing is becoming more and more popular in high performance computing (HPC), with new clusters combining CPUs and graphic processing units (GPUs). However, hybrid applications cannot always use all the resources available on the nodes, leaving some idle resources suc h us GPUs or CPU cores. In this work, we propose a hybrid diskless checkpoint (HDC) technique for GPU-accelerated clusters, that can checkpoint CPU/GPU applications, does not require spare nodes and can tolerate up to 50% of process failures with a low, sometimes negligible, checkpoint overhead.
  • Keywords
    checkpointing; computer graphic equipment; coprocessors; fault tolerant computing; software fault tolerance; CPU; GPU-accelerated cluster; HDC technique; MTBF; checkpoint-restart; fault tolerance method; hybrid computing system; low-overhead diskless checkpoint; mean time between failure; supercomputer; Computer architecture; Encoding; Fault tolerance; Fault tolerant systems; Graphics processing unit; Reed-Solomon codes;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing (HiPC), 2010 International Conference on
  • Conference_Location
    Dona Paula
  • Print_ISBN
    978-1-4244-8518-5
  • Electronic_ISBN
    978-1-4244-8519-2
  • Type

    conf

  • DOI
    10.1109/HIPC.2010.5713163
  • Filename
    5713163