DocumentCode :
1759971
Title :
Soft Failures in Large Datacenters
Author :
Sankar, S. ; Gurumurthi, Sudhanva
Volume :
13
Issue :
2
fYear :
2014
fDate :
July-Dec. 11 2014
Firstpage :
105
Lastpage :
108
Abstract :
A major problem in managing large-scale datacenters is diagnosing and fixing machine failures. Most large datacenter deployments have a management infrastructure that can help diagnose failure causes, and manage assets that were fixed as part of the repair process. Previous studies identify only actual hardware replacements to calculate Annualized Failure Rate (AFR) and component reliability. In this paper, we show that service availability is significantly affected by soft failures and that this class of failures is becoming an important issue at large datacenters with minimum human intervention. Soft failures in the datacenter do not require actual hardware replacements, but still result in service downtime, and are equally important because they disrupt normal service operation. We show failure trends observed in a large datacenter deployment of commodity servers and motivate the need to modify conventional datacenter designs to help reduce soft failures and increase service availability.
Keywords :
computer centres; fault diagnosis; AFR; annualized failure rate; asset management; commodity servers; component reliability; datacenter deployments; datacenter designs; datacenter management; failure cause diagnosis; hardware replacements; machine failure diagnosis; machine failure fixing; management infrastructure; repair process; service availability; soft failures; Client-server systems; Data centers; Hard disks; Large-scale systems; Maintenance engineering; Market research; Transient analysis; C.4 Performance of Systems; C.5.5 Servers; Characterization; Datacenter; Management; Reliability;
fLanguage :
English
Journal_Title :
Computer Architecture Letters
Publisher :
ieee
ISSN :
1556-6056
Type :
jour
DOI :
10.1109/L-CA.2013.25
Filename :
6585257
Link To Document :
بازگشت