• DocumentCode
    228773
  • Title

    Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales

  • Author

    Sheng Di ; Bautista-Gome, Leonardo ; Cappello, Franck

  • Author_Institution
    INRIA, Sophia-Antipolis, France
  • fYear
    2014
  • fDate
    16-21 Nov. 2014
  • Firstpage
    907
  • Lastpage
    918
  • Abstract
    Future extreme-scale systems are expected to experience different types of failures affecting applications with different failure scales, from transient uncorrectable memory errors in processes to massive system outages. In this paper, we propose a multilevel checkpoint model by taking into account uncertain execution scales (different numbers of processes/cores). The contribution is threefold: (1) we provide an in-depth analysis on why it is difficult to derive the optimal checkpoint intervals for different checkpoint levels and optimize the number of cores simultaneously, (2) we devise a novel method that can quickly obtain an optimized solution -- the first successful attempt in multilevel checkpoint models with uncertain scales, and (3) we perform both large scale real experiments and extreme-scale numerical simulation to validate the effectiveness of our design. The experiments confirm that our optimized solution outperforms other state of-the-art solutions by 4.3 -- 88% on wall-clock length.
  • Keywords
    checkpointing; multiprocessing systems; numerical analysis; optimisation; checkpoint levels; extreme-scale numerical simulation; extreme-scale systems; failure scales; massive system outages; multilevel checkpoint model; optimal checkpoint intervals; optimization; processes/cores; transient uncorrectable memory errors; uncertain execution scales; wall-clock length; Analytical models; Approximation algorithms; Computational modeling; Equations; Heating; Mathematical model; Optimization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for
  • Conference_Location
    New Orleans, LA
  • Print_ISBN
    978-1-4799-5499-5
  • Type

    conf

  • DOI
    10.1109/SC.2014.79
  • Filename
    7013061