• DocumentCode
    1925329
  • Title

    Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm

  • Author

    Ni, Xiang ; Meneses, Esteban ; Kalé, Laxmikant V.

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
  • fYear
    2012
  • fDate
    24-28 Sept. 2012
  • Firstpage
    364
  • Lastpage
    372
  • Abstract
    The HPC community has seen a steady increase in the number of components in every generation of supercomputers. Assembling a large number of components into a single cluster makes a machine more powerful, but also much more prone to failures. Therefore, fault tolerance has become a major concern in HPC. To deal with node crashes in large systems, checkpoint/restart is by far the preferred method. A typical way to implement checkpoints is by using a blocking algorithm, which suspends the execution of the application while the checkpoint is safely stored. One limitation of the blocking algorithm is that it saturates the network bandwidth at the time of checkpoint. This problem will become even more critical because the projected network bandwidth increase will not match the increase in memory per node. To alleviate this problem, we have developed a semi-blocking checkpoint algorithm that overlaps execution of the application with transmission of checkpoints. Our implementation decomposes a checkpoint into small messages that are interleaved with application messages. The experimental results show a dramatic reduction in the checkpoint overhead for various applications. We present a model for our approach and use this model to compute the benefit of the semi-blocking algorithm for different failure rates predicted at Exascale. We estimate our method can reduce up to 22% the total execution time of an iterative scientific application.
  • Keywords
    checkpointing; distributed processing; fault tolerant computing; natural sciences computing; HPC applications; checkpoint overhead hiding; checkpoint-restart method; exascale; fault tolerance; iterative scientific application; node crash; semiblocking checkpointing algorithm; supercomputers; Bandwidth; Checkpointing; Computational modeling; Interference; Message systems; Protocols; Synchronization; SSD; adaptive runtime system; checkpoint/restart; fault tolerance; semi-blocking algorithm;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing (CLUSTER), 2012 IEEE International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4673-2422-9
  • Type

    conf

  • DOI
    10.1109/CLUSTER.2012.82
  • Filename
    6337799