Title :
Failure data analysis of a large-scale heterogeneous server environment
Author :
Sahoo, Ramendra K. ; Squillante, Mark S. ; Sivasubramaniam, Anand ; Zhang, Yanyong
Author_Institution :
Dept. of Exploratory Server Syst., IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
fDate :
28 June-1 July 2004
Abstract :
The growing complexity of hardware and software mandates the recognition of fault occurrence in system deployment and management. While there are several techniques to prevent and/or handle faults, there continues to be a growing need for an in-depth understanding of system errors and failures and their empirical and statistical properties. This understanding can help evaluate the effectiveness of different techniques for improving system availability, in addition to developing new solutions. In this paper, we analyze the empirical and statistical properties of system errors and failures from a network of nearly 400 heterogeneous servers running a diverse workload over a year. While improvements in system robustness continue to limit the number of actual failures to a very small fraction of the recorded errors, the failure rates are significant and highly variable. Our results also show that the system error and failure patterns are comprised of time-varying behavior containing long stationary intervals. These stationary intervals exhibit various strong correlation structures and periodic patterns, which impact performance but also can be exploited to address such performance issues.
Keywords :
failure analysis; fault tolerant computing; network servers; system recovery; failure data analysis; fault handling; fault occurrence recognition; fault prevention; hardware complexity; software complexity; system errors; system failures; system management; Circuits; Computer bugs; Computer science; Cosmic rays; Costs; Data analysis; Failure analysis; Hardware; Large-scale systems; Voltage;
Conference_Titel :
Dependable Systems and Networks, 2004 International Conference on
Print_ISBN :
0-7695-2052-9
DOI :
10.1109/DSN.2004.1311948