Title :
With great reliability comes great responsibility: tradeoffs of run-time policy on high reliability systems
Author :
Kleban, Stephen D. ; Johnston, Jeanette R. ; Ang, James A. ; Clearwater, S.H.
Author_Institution :
Sandia Nat. Labs., Albuquerque, NM, USA
Abstract :
In this paper we describe a simulation study to improve performance on a large highly utilized cluster at Sandia National Laboratories. The unique characteristic about the cluster is that there are very few constraints on job size. In particular, the run-time is limited only by system times which occur about every two weeks. The major contribution of this paper is that we quantify the difference in makespan between running a single long job and its equivalent in many shorter jobs. We find that running longer jobs is beneficial to the facility as a whole when the cycle-weighted makespans are considered and that running shorter jobs has an overall beneficial effect on the makespan for the jobs taken unweighted and for most users.
Keywords :
computer network reliability; performance evaluation; scheduling; workstation clusters; Sandia National Laboratories; cycle-weighted makespans; high reliability systems; highly utilized cluster; job size; performance; run-time policy; simulation study; Computerized monitoring; Delay; Failure analysis; Laboratories; Occupational stress; Performance analysis; Productivity; Runtime; Scalability; Supercomputers;
Conference_Titel :
Cluster Computing and the Grid, 2004. CCGrid 2004. IEEE International Symposium on
Print_ISBN :
0-7803-8430-X
DOI :
10.1109/CCGrid.2004.1336653