DocumentCode :
169061
Title :
Improving an MPI Application-Level Migration Approach through Checkpoint File Splitting
Author :
Rodriguez, M. ; Cores, Ivan ; Gonzalez, P. ; Martin, Maria J.
Author_Institution :
Comput. Archit. Group, Univ. of A Coruna, A Coruna, Spain
fYear :
2014
fDate :
22-24 Oct. 2014
Firstpage :
33
Lastpage :
40
Abstract :
Traditionally used for load balancing, process migration has been gaining popularity in the fault tolerance context. Recently, checkpoint-based migration has been proposed to implement failure avoidance in MPI applications through the proactive migration of processes when impending failures are notified. However, the main drawback of checkpoint-based migration in these scenarios is its high I/0 cost, which may be unfeasible if the migration operation is not completed before the failure arises. To overcome this issue, this work proposes to split the checkpoint files of an application-level migration approach into multiple smaller files to overlap the different phase of the migration operation: checkpoint file writing in the terminating process, with data transferring through the network, and state file read and restart operations in the new spawned processes. The proposal has been tested using the MPI NAS Parallel Benchmarks. The experimental results show a significant reduction in the migration time.
Keywords :
checkpointing; message passing; program verification; resource allocation; software fault tolerance; MPI NAS parallel benchmarks; MPI application-level migration; MPI applications; checkpoint file splitting; checkpoint file writing; checkpoint-based migration; data transferring; failure avoidance; fault tolerance context; high I/0 cost; load balancing; migration operation; migration time reduction; proactive migration; process migration; terminating process; Benchmark testing; Checkpointing; Computer architecture; Fault tolerance; Fault tolerant systems; Proposals; Writing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Architecture and High Performance Computing (SBAC-PAD), 2014 IEEE 26th International Symposium on
Conference_Location :
Jussieu
ISSN :
1550-6533
Type :
conf
DOI :
10.1109/SBAC-PAD.2014.25
Filename :
6970644
Link To Document :
بازگشت