DocumentCode :
3042565
Title :
On the feasibility of incremental checkpointing for scientific computing
Author :
Sancho, J.C. ; Petrini, Fabrizio ; Johnson, Garth ; Frachtenberg, E.
Author_Institution :
Performance & Archit. Lab., Los Alamos Nat. Lab., NM, USA
fYear :
2004
fDate :
26-30 April 2004
Firstpage :
58
Abstract :
Summary form only given. In the near future large-scale parallel computers will feature hundreds of thousands of processing nodes. In such systems, fault tolerance is critical as failures will occur very often. Checkpointing and rollback recovery has been extensively studied as an attempt to provide fault tolerance. However, current implementations do not provide the total transparency and full flexibility that are necessary to support the new paradigm of autonomic computing - systems able to self-heal and self-repair. We provide an in-depth evaluation of incremental checkpointing for scientific computing. The experimental results, obtained on a state-of-the art cluster running several scientific applications, show that efficient, scalable, automatic and user-transparent incremental checkpointing is within reach with current technology.
Keywords :
fault tolerance; parallel machines; system recovery; autonomic computing system; fault tolerance; in-depth evaluation; incremental checkpointing; large-scale parallel computer; rollback recovery; scientific computing; self-heal; self-repair; state-of-the art cluster; Application software; Checkpointing; Computer networks; Concurrent computing; Costs; Fault tolerance; Hardware; High performance computing; Laboratories; Large-scale systems;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International
Print_ISBN :
0-7695-2132-0
Type :
conf
DOI :
10.1109/IPDPS.2004.1302982
Filename :
1302982
Link To Document :
بازگشت