• DocumentCode
    2244029
  • Title

    Hybrid Checkpointing for MPI Jobs in HPC Environments

  • Author

    Wang, Chao ; Mueller, Frank ; Engelmann, Christian ; Scott, Stephen L.

  • Author_Institution
    Dept. of Comput. Sci., North Carolina State Univ., Raleigh, NC, USA
  • fYear
    2010
  • fDate
    8-10 Dec. 2010
  • Firstpage
    524
  • Lastpage
    533
  • Abstract
    As the core count in high-performance computing systems keeps increasing, faults are becoming common place. Check pointing addresses such faults but captures full process images even though only a subset of the process image changes between checkpoints. We have designed a hybrid check pointing technique for MPI tasks of high-performance applications. This technique alternates between full and incremental checkpoints: At incremental checkpoints, only data changed since the last checkpoint is captured. Our implementation integrates new BLCR and LAM/MPI features that complement traditional full checkpoints. This results in significantly reduced checkpoint sizes and overheads with only moderate increases in restart overhead. After accounting for cost and savings, benefits due to incremental checkpoints are an order of magnitude larger than overheads on restarts. We further derive qualitative results indicating an optimal balance between full/incremental checkpoints of our novel approach at a ratio of 1:9, which outperforms both always-full and always-incremental check pointing.
  • Keywords
    application program interfaces; checkpointing; BLCR; LAM-MPI; MPI jobs; high performance computing systems; hybrid checkpointing; Checkpoint/Restart; Fault Tolerance; High-Performance Computing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Systems (ICPADS), 2010 IEEE 16th International Conference on
  • Conference_Location
    Shanghai
  • ISSN
    1521-9097
  • Print_ISBN
    978-1-4244-9727-0
  • Electronic_ISBN
    1521-9097
  • Type

    conf

  • DOI
    10.1109/ICPADS.2010.48
  • Filename
    5695644