Title :
Utilizing the Multi-threading Techniques to Improve the Two-Level Checkpoint/Rollback System for MPI Applications
Author :
Tang, Yuan ; Zhang, Yunquan
Author_Institution :
Software Sch., Fudan Univ., Shanghai
Abstract :
With the increasing number of processors in modern HPC (high performance computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolerance. In our previous work, we proposed an MPI operation level checkpoint/rollback system. The main benefits of the system is that it offers the opportunity to employ in-memory (disk-less) checkpoint/rollback techniques which has demonstrated a much better performance over its on-disk counterpart, and the opportunity to have a concurrent two level recover-and-continue MPI system which has been proven to have a high efficiency. To the scope of my knowledge, this is the first concurrent two-level checkpoint/recovery system in use. With the coming of multi-core era, it´s time to utilize the multi-threading techniques to improve the performance of in-memory checkpointing algorithm. In this paper, we present two versions of MPI operation level checkpoint/rollback system, one is of single-threaded, the other is of multi-threaded. Also, we provide an in-depth performance analysis between these two approaches to illustrate the benefits of multi-threading techniques on multi-core platform. With the progress of our work, a picture of the hierarchy of future generation fault tolerant HPC system is gradually unrolled.
Keywords :
application program interfaces; checkpointing; message passing; multi-threading; software fault tolerance; MPI system; fault tolerance; high performance computing systems; in-memory checkpointing algorithm; multicore platform; multithreading techniques; scalability; two-level checkpoint-rollback system; Checkpointing; Communication system software; Costs; Fault tolerance; Fault tolerant systems; High performance computing; Parallel processing; Parallel programming; Scalability; Software performance; FT-MPI; Recover-and-continue; checkpoint; multi-core; multi-threading programming model; rollback; stop-and-restart;
Conference_Titel :
High Performance Computing and Communications, 2008. HPCC '08. 10th IEEE International Conference on
Conference_Location :
Dalian
Print_ISBN :
978-0-7695-3352-0
DOI :
10.1109/HPCC.2008.58