• DocumentCode
    2537320
  • Title

    Optimizing HPC Fault-Tolerant Environment: An Analytical Approach

  • Author

    Jin, Hui ; Chen, Yong ; Zhu, Huaiyu ; Sun, Xian-He

  • Author_Institution
    Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL, USA
  • fYear
    2010
  • fDate
    13-16 Sept. 2010
  • Firstpage
    525
  • Lastpage
    534
  • Abstract
    The increasingly large ensemble size of modern High-Performance Computing (HPC) systems has drastically increased the possibility of failures. Performance under failures and its optimization become timely important issues facing the HPC community. In this study, we propose an analytical model to predict the application performance. The model characterizes the impact of coordinated checkpointing and system failures on application performance, considering all the factors including workload, the number of nodes, failure arrival rate, recovery cost, and checkpointing interval and overhead. Based on the model, we gauge three parameters, the number of compute nodes, checkpointing interval, and the number of spare nodes to conduct a comprehensive study of performance optimization under failures. Performance scalability under failures is also studied to explore the performance improvement space for different parameters. Experimental results from both synthetic and actual system failure logs confirm that the proposed model and optimization methodologies are effective and feasible.
  • Keywords
    checkpointing; fault tolerant computing; optimisation; HPC fault-tolerant environment; checkpointing interval parameter; compute nodes parameter; high-performance computing; optimization; performance optimization; performance scalability; spare nodes parameter; Checkpointing; Computational modeling; Equations; Estimation; Maintenance engineering; Optimization; Random variables; Checkpointing; Fault Tolerance; High-Performance Computing; Performance Optimization; Scalability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel Processing (ICPP), 2010 39th International Conference on
  • Conference_Location
    San Diego, CA
  • ISSN
    0190-3918
  • Print_ISBN
    978-1-4244-7913-9
  • Electronic_ISBN
    0190-3918
  • Type

    conf

  • DOI
    10.1109/ICPP.2010.80
  • Filename
    5599253