Title :
Evaluating Performance Impacts of Delayed Failure Repairing on Large-Scale Systems
Author :
Zhou, Zhou ; Tang, Wei ; Zheng, Ziming ; Lan, Zhiling ; Desai, Narayan
Author_Institution :
Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL, USA
Abstract :
With the fast improvement in technology, we are now moving toward exascale computing. Many experts predict that exascale computers will have millions of nodes, billions of threads of execution, hundreds of petabytes of inner memory and exabytes of persistent storage. For systems of such a scale, frequent failures are becoming a serious concern. One of the most important reasons is that in a large-scale system it is hard to detect failures. As a result, failure repair may take substantial time. In this paper, we investigate the effect of delayed repairing on two popular types of high-performance computing systems: IBM Blue Gene/P and general cluster. We analyze how delayed failure repairing will affect the performance of jobs when some computing units are at fault but not fixed in time. Our study is based on real workload traces and RAS logs collected from production supercomputing systems. Our Trace-based simulations indicate that fast failure detection and recovery is essential for moving towards petascale and beyond computing.
Keywords :
IBM computers; large-scale systems; mainframes; parallel machines; random-access storage; IBM Blue Gene/P; RAS logs; delayed failure repairing; delayed repairing; exascale computing; fast failure detection; general cluster; high-performance computing systems; inner memory; job performance; large-scale systems; performance impact evaluation; persistent storage; petascale computing; production supercomputing systems; trace-based simulations; Delay; Maintenance engineering; Processor scheduling; System performance; Time factors; Weibull distribution; delayed failure repairing; performance impact; resource management;
Conference_Titel :
Cluster Computing (CLUSTER), 2011 IEEE International Conference on
Conference_Location :
Austin, TX
Print_ISBN :
978-1-4577-1355-2
Electronic_ISBN :
978-0-7695-4516-5
DOI :
10.1109/CLUSTER.2011.71