DocumentCode :
2294072
Title :
Adaptive and Fault Tolerant Simulation of Relativistic Particle Transport with Data-Level Checkpointing
Author :
Li, Ruipeng ; Jiang, Hai ; Su, Hung-Chi ; Zhang, Bin ; Jenness, Jeff
Author_Institution :
Dept. of Comput. Sci., Arkansas State Univ., Jonesboro, AR
fYear :
2008
fDate :
16-18 July 2008
Firstpage :
345
Lastpage :
352
Abstract :
Many scientific applications exhibit high demands on memory storage and computing capability. Improvements in commodity processors and networks have provided an opportunity to support such scientific applications within an everyday computing infrastructure. Good applications need the ability to work in constantly changing environments. Adaptability and fault tolerance are essential. Based on simulation of relativistic particle transport, this paper proposes a data-level checkpointing scheme for common scientific applications. This scheme takes advantage of the regular program layout, dominant computing loops, and fine-grained iterations. Without handling stack and heap segments directly, only application data is saved and restored as the computation state. Checkpointing interval can be dynamically adjusted to satisfy sensitivity and efficiency requirements for feasible fault tolerance. With this periodic but fixed-location checkpointing scheme, the MPI- based simulation system can be reconfigured by being shut down first and then restarted on same or different computer clusters. Application data can be redistributed for the new configuration. Experimental results have demonstrated this scheme´s efficiency and effectiveness.
Keywords :
application program interfaces; checkpointing; digital simulation; fault tolerant computing; iterative methods; message passing; natural sciences computing; MPI- based simulation system; adaptive simulation; commodity processors; computer clusters; computing capability; data-level checkpointing; dominant computing loops; fault tolerant simulation; fine-grained iterations; fixed-location checkpointing scheme; memory storage; regular program layout; relativistic particle transport; scientific applications; Application software; Checkpointing; Computational modeling; Computer crashes; Computer networks; Computer simulation; Distributed computing; Fault tolerance; Physics computing; Plasma simulation; Checkpointing; Fault Tolerance; Reconfiguration; Relativistic Particle Transport; Simulation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Science and Engineering, 2008. CSE '08. 11th IEEE International Conference on
Conference_Location :
Sao Paulo
Print_ISBN :
978-0-7695-3193-9
Type :
conf
DOI :
10.1109/CSE.2008.54
Filename :
4578252
Link To Document :
بازگشت