Title :
Hierarchical replication techniques to ensure checkpoint storage reliability in grid environment
Author :
Bouabache, Fatiha ; Herault, Thomas ; Fedak, Gilles ; Cappello, Franck
Author_Institution :
Univ. Paris Sud-XI, Orsay
fDate :
March 31 2008-April 4 2008
Abstract :
High performance computing has an important role in scientific and engineering researches. As the size of high performance systems increases continuously, the average time between failures becomes increasingly small. So fault tolerance becomes a critical property for parallel applications running on these systems. MPI (message passing interface) paradigm is actually the most used to write parallel applications. However, in traditional implementations, when a failure occurs, the whole distributed application is shutdown and restarted. To avoid this, many solutions have been proposed, but the most used is rollback recovery. Rollback recovery is based upon the concept of a checkpoint. A checkpoint describes the state of one or more components of the system at a given time of its execution.
Keywords :
application program interfaces; checkpointing; fault tolerance; grid computing; message passing; Hierarchical Replication Techniques to Ensure Checkpoint Storage Reliability in Grid; checkpoint storage reliability; grid environment; hierarchical replication technique; high performance computing; message passing interface; rollback recovery; Communication channels; Fault tolerant systems; Frequency; High performance computing; Image storage; Message passing; Protocols; Reliability engineering; Switches; Topology;
Conference_Titel :
Computer Systems and Applications, 2008. AICCSA 2008. IEEE/ACS International Conference on
Conference_Location :
Doha
Print_ISBN :
978-1-4244-1967-8
Electronic_ISBN :
978-1-4244-1968-5
DOI :
10.1109/AICCSA.2008.4493654