• DocumentCode
    1640774
  • Title

    Application Resilience: Making Progress in Spite of Failure

  • Author

    Jones, William M. ; Daly, John T. ; DeBardeleben, Nathan A.

  • Author_Institution
    Dept. of Electr. & Comput. Eng., United States Naval Acad., Annapolis, MD
  • fYear
    2008
  • Firstpage
    789
  • Lastpage
    794
  • Abstract
    While measures such as raw compute performance and system capacity continue to be important factors for evaluating cluster performance, such issues as system reliability and application resilience have become increasingly important as cluster sizes rapidly grow. Although efforts to directly improve fault-tolerance are important, it is also essential to accept that application failures will inevitably occur and to ensure that progress is made despite these failures. Application monitoring frameworks are central to providing application resilience. As such, the central theme of this paper is to address the impact that application monitoring detection latency has on the overall system performance. We find that immediate fault detection is not necessary in order to obtain substantial improvement in performance. This conclusion is significant because it implies that less complex, highly portable, and predominately less expensive failure detection schemes would provide adequate application resilience.
  • Keywords
    distributed programming; fault tolerant computing; system monitoring; workstation clusters; application monitoring detection latency; application resilience; cluster performance evaluation; Application software; Computer networks; Condition monitoring; Delay; Grid computing; High performance computing; Reliability; Resilience; System performance; USA Councils; application monitoring; application resilience; cluster computing; error detection; fault tolerance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on
  • Conference_Location
    Lyon
  • Print_ISBN
    978-0-7695-3156-4
  • Electronic_ISBN
    978-0-7695-3156-4
  • Type

    conf

  • DOI
    10.1109/CCGRID.2008.99
  • Filename
    4534305