DocumentCode :
2313262
Title :
High Performance Computational Grids Fault Tolerance at System Level
Author :
Mujumdar, Manik ; Bheevgade, Meenakshi ; Malik, Latesh ; Patrikar, Rajendra
Author_Institution :
G.H. Raisoni Coll. of Eng., Nagpur
fYear :
2008
fDate :
16-18 July 2008
Firstpage :
379
Lastpage :
383
Abstract :
Many complex scientific, mathematical applications require large time for completion. To deal with this issue, parallelization is popularly used. Distributing an application onto several machines is one of the key aspects of grid-computing. This paper focuses on a check point/restart mechanism used to overcome the problem of job suspension at a failed node in a computational Grid. The ability to checkpoint a running application and restart it later can provide many useful benefits including fault recovery by rolling back an application to a previous checkpoint, advanced resources sharing, better application response time by restarting applications from checkpoints instead of from scratch, and improved system utilization, efficient high performance computing and improved service availability.
Keywords :
fault tolerant computing; grid computing; check point-restart mechanism; fault recovery; fault tolerance; grid computing; high performance computational grids; job suspension; parallelization; Access protocols; Application software; Concurrent computing; Distributed computing; Fault tolerant systems; Grid computing; High energy physics instrumentation computing; High performance computing; Pervasive computing; Resource management; Checkpoint/Restart; Cluster; Computational Grid; Fault Tolerance; High performance Computing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Emerging Trends in Engineering and Technology, 2008. ICETET '08. First International Conference on
Conference_Location :
Nagpur, Maharashtra
Print_ISBN :
978-0-7695-3267-7
Electronic_ISBN :
978-0-7695-3267-7
Type :
conf
DOI :
10.1109/ICETET.2008.21
Filename :
4579928
Link To Document :
بازگشت