DocumentCode
2194898
Title
Scientific Computing Autonomic Reliability Framework
Author
Dubey, Abhishek ; Neema, Sandeep ; Kowalkowski, Jim ; Singh, Amitoj
Author_Institution
Inst. for Software Integrated Syst., Vanderbilt Univ., Nashville, TN
fYear
2008
fDate
7-12 Dec. 2008
Firstpage
352
Lastpage
353
Abstract
Large scientific computing clusters require a distributed dependability subsystem that can provide fault isolation and recovery and is capable of learning and predicting failures, to improve the reliability of scientific workflows. In this paper, we outline the key ideas in the design of a Scientific Computing Autonomic Reliability Framework (SCARF) for large computing clusters used in the Lattice Quantum Chromo Dynamics project at Fermi Lab.
Keywords
fault diagnosis; fault tolerant computing; natural sciences computing; reliability; workflow management software; distributed dependability subsystem; fault isolation and recovery; scientific computing autonomic reliability framework; scientific workflows; Centralized control; Computer architecture; Condition monitoring; Engines; Environmental management; Fault diagnosis; Quantum computing; Resource management; Scientific computing; Software systems; Cluster Computing; Reliability; Software fault-tolerance; Workflows;
fLanguage
English
Publisher
ieee
Conference_Titel
eScience, 2008. eScience '08. IEEE Fourth International Conference on
Conference_Location
Indianapolis, IN
Print_ISBN
978-1-4244-3380-3
Electronic_ISBN
978-0-7695-3535-7
Type
conf
DOI
10.1109/eScience.2008.113
Filename
4736792
Link To Document