• DocumentCode
    2455801
  • Title

    HiAL-Ckpt: A hierarchical application-level checkpointing for CPU-GPU hybrid systems

  • Author

    Xu, Xinhai ; Lin, Yufei ; Tang, Tao ; Lin, Yisong

  • Author_Institution
    Nat. Lab. for Parallel & Distrib. Process., Nat. Univ. of Defense Technol., Changsha, China
  • fYear
    2010
  • fDate
    24-27 Aug. 2010
  • Firstpage
    1895
  • Lastpage
    1899
  • Abstract
    In light of its powerful computing capacity and high energy efficiency, GPU (graphics processing unit) has become a focus in the research field of HPC (High Performance Computing). CPU-GPU heterogeneous parallel systems have become a new development trend of super-computer. However, the inherent unreliability of the GPU hardware deteriorates the reliability of super-computer. We have researched on the fault-tolerance(FT) technique for CPU-GPU heterogeneous parallel systems, and introduced a new checkpointing mechanism, i.e., the hierarchical application-level checkpointing, for such systems. The basic idea of this new checkpointing mechanism is checkpointing at two independent levels, i.e., CPU level and GPU level, to tolerate CPU and GPU faults respectively. Based on the idea, we have also designed and implemented a hierarchical application-level checkpointing tool ”HiAL-Ckpt”. Using this tool, programmers can insert two kinds of directives, i.e., CPU directives and GPU directives into a program, and the compiler will transform the directives into CPU or GPU checkpointing codes according to their nature. From the case study of SWIM, a test bench from spec2000 benchmark suite, we have demonstrated the validity of the hierarchical application-level checkpointing technique. The experimental results show that the falut-tolerance temporal cost of HiAL-Ckpt for SWIM is only 2.25%, compared with the executing time of SWIM without any FT work.
  • Keywords
    benchmark testing; checkpointing; computer graphic equipment; coprocessors; fault tolerance; fault tolerant computing; multiprocessing systems; parallel processing; program compilers; CPU directive; CPU-GPU heterogeneous parallel system; GPU directive; HiAL-Ckpt; SWIM; checkpointing code; fault tolerance technique; graphics processing unit; hierarchical application level checkpointing; high performance computing; program compiler; spec2000 benchmark suite; supercomputer reliability; Checkpointing; Computational modeling; Fault tolerance; Fault tolerant systems; Graphics processing unit; Hardware; GPU; checkpointing; fault-tolerance; heterogeneous systems;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Science and Education (ICCSE), 2010 5th International Conference on
  • Conference_Location
    Hefei
  • Print_ISBN
    978-1-4244-6002-1
  • Type

    conf

  • DOI
    10.1109/ICCSE.2010.5593819
  • Filename
    5593819