• DocumentCode
    3322504
  • Title

    Proactive Fault Tolerance Using Preemptive Migration

  • Author

    Engelmann, C. ; Vallee, G.R. ; Naughton, T. ; Scott, S.L.

  • Author_Institution
    Comput. Sci. & Math. Div., Oak Ridge Nat. Lab., Oak Ridge, TN
  • fYear
    2009
  • fDate
    18-20 Feb. 2009
  • Firstpage
    252
  • Lastpage
    257
  • Abstract
    Proactive fault tolerance (FT) in high-performance computing is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating application parts away from nodes that are about to fail. This paper provides a foundation for proactive FT by defining its architecture and classifying implementation options. This paper further relates prior work to the presented architecture and classification, and discusses the challenges ahead for needed supporting technologies.
  • Keywords
    fault tolerant computing; parallel processing; system recovery; high-performance computing; parallel application; preemptive migration; proactive fault tolerance architecture; system failure; Application software; Computer architecture; Computer networks; Concurrent computing; Condition monitoring; Degradation; Distributed computing; Fault tolerance; Laboratories; Resource management; fault tolerance; high-performance computing; preemptive migration;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel, Distributed and Network-based Processing, 2009 17th Euromicro International Conference on
  • Conference_Location
    Weimar
  • ISSN
    1066-6192
  • Print_ISBN
    978-0-7695-3544-9
  • Type

    conf

  • DOI
    10.1109/PDP.2009.31
  • Filename
    4912941