Title of article :
Improving availability with recursive microreboots: a soft-state system case study
Author/Authors :
André Luis Peixoto Candéa، نويسنده , , George and Cutler، نويسنده , , James and Fox، نويسنده , , Armando، نويسنده ,
Issue Information :
روزنامه با شماره پیاپی سال 2004
Pages :
36
From page :
213
To page :
248
Abstract :
Even after decades of software engineering research, complex computer systems still fail. This paper makes the case for increasing research emphasis on dependability and, specifically, on improving availability by reducing time-to-recover. ftware fails at some point, so systems must be able to recover from failures. Recovery itself can fail too, so systems must know how to intelligently retry their recovery. We present here a recursive approach, in which a minimal subset of components is recovered first; if that does not work, progressively larger subsets are recovered. Our domain of interest is Internet services; these systems experience primarily transient or intermittent failures, that can typically be resolved by rebooting. Conceding that failure-free software will continue eluding us for years to come, we undertake a systematic investigation of fine grain component-level restarts, microreboots, as high availability medicine. Building and maintaining an accurate model of large Internet systems is nearly impossible, due to their scale and constantly evolving nature, so we take an application-generic approach, that relies on empirical observations to manage recovery. ly recursive microreboots to Mercury, a commercial off-the-shelf (COTS)-based satellite ground station that is based on an Internet service platform. Mercury has been in successful operation for over 3 years. From our experience with Mercury, we draw design guidelines and lessons for the application of recursive microreboots to other software systems. We also present a set of guidelines for building systems amenable to recursive reboots, known as “crash-only software systems.”
Keywords :
Recovery-oriented computing , High Availability , Microreboots
Journal title :
Performance Evaluation
Serial Year :
2004
Journal title :
Performance Evaluation
Record number :
1569756
Link To Document :
بازگشت