• DocumentCode
    2194898
  • Title

    Scientific Computing Autonomic Reliability Framework

  • Author

    Dubey, Abhishek ; Neema, Sandeep ; Kowalkowski, Jim ; Singh, Amitoj

  • Author_Institution
    Inst. for Software Integrated Syst., Vanderbilt Univ., Nashville, TN
  • fYear
    2008
  • fDate
    7-12 Dec. 2008
  • Firstpage
    352
  • Lastpage
    353
  • Abstract
    Large scientific computing clusters require a distributed dependability subsystem that can provide fault isolation and recovery and is capable of learning and predicting failures, to improve the reliability of scientific workflows. In this paper, we outline the key ideas in the design of a Scientific Computing Autonomic Reliability Framework (SCARF) for large computing clusters used in the Lattice Quantum Chromo Dynamics project at Fermi Lab.
  • Keywords
    fault diagnosis; fault tolerant computing; natural sciences computing; reliability; workflow management software; distributed dependability subsystem; fault isolation and recovery; scientific computing autonomic reliability framework; scientific workflows; Centralized control; Computer architecture; Condition monitoring; Engines; Environmental management; Fault diagnosis; Quantum computing; Resource management; Scientific computing; Software systems; Cluster Computing; Reliability; Software fault-tolerance; Workflows;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    eScience, 2008. eScience '08. IEEE Fourth International Conference on
  • Conference_Location
    Indianapolis, IN
  • Print_ISBN
    978-1-4244-3380-3
  • Electronic_ISBN
    978-0-7695-3535-7
  • Type

    conf

  • DOI
    10.1109/eScience.2008.113
  • Filename
    4736792