• DocumentCode
    2705271
  • Title

    Coarse grain computation-communication overlap for efficient application-level checkpointing for GPUs

  • Author

    Solano-Quinde, Lizandro D. ; Bode, Brett M. ; Somani, Arun K.

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Iowa State Univ., Ames, IA, USA
  • fYear
    2010
  • fDate
    20-22 May 2010
  • Firstpage
    1
  • Lastpage
    5
  • Abstract
    Graphics Processing Units (GPUs) are increasingly used to solve non-graphical scientific problems. However, it has been shown that the reliability of the GPUs is a concern because of the occurrence of the soft and hard errors. The checkpoint/restart is the most commonly used technique to achieve fault tolerance in the presence of failures. This work present an application-level checkpoint scheme for systems composed of GPUs. Our scheme exploits the benefits of the divide-and-conquer technique and of the communication-computation overlapping to improve the execution time and checkpoint overhead. By dividing the problem and checkpointing in n subprocesses, we show that our scheme improves the checkpoint overhead by a factor of n. We also show that dividing the problem with finer granularity is not beneficial.
  • Keywords
    computer graphic equipment; coprocessors; fault tolerant computing; GPU; coarse grain computation communication overlap; divide-and-conquer technique; efficient application level checkpointing; fault tolerance; graphics processing units; Checkpointing; Computer languages; Fault tolerance; Fault tolerant systems; Graphics processing unit; Instruction sets; Memory management; CUDA; Checkpoint; Fault tolerance; GPU; Tesla;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Electro/Information Technology (EIT), 2010 IEEE International Conference on
  • Conference_Location
    Normal, IL
  • ISSN
    2154-0357
  • Print_ISBN
    978-1-4244-6873-7
  • Type

    conf

  • DOI
    10.1109/EIT.2010.5612125
  • Filename
    5612125