DocumentCode
3247890
Title
A trace-driven emulation framework to predict scalability of large clusters in presence of OS Jitter
Author
De, Pradipta ; Kothari, Ravi ; Mann, Vijay
Author_Institution
IBM India Res. Lab., New Delhi
fYear
2008
fDate
Sept. 29 2008-Oct. 1 2008
Firstpage
232
Lastpage
241
Abstract
Various studies have pointed out the debilitating effects of OS jitter on the performance of parallel applications on large clusters such as the ASCI Purple and the Mare Nostrum at Barcelona Supercomputing Center. These clusters use commodity OSes such as AIX and Linux respectively. The biggest hindrance in evaluating any technique to mitigate jitter is getting access to such large scale production HPC systems running a commodity OS. An earlier attempt aimed at solving this problem was to emulate the effects of OS jitter on more widely available and jitter-free systems such as BlueGene/L. In this paper, we point out the shortcomings of previous such approaches and present the design and implementation of an emulation framework that helps overcome those shortcomings by using innovative techniques. We collect jitter traces on a commodity OS with a given configuration, under which we want to study the scaling behavior. These traces are then replayed on a jitter-free system to predict scalability in presence of OS jitter. The application of this emulation framework to predict scalability is illustrated through a comparative scalability study of an off-the-shelf Linux distribution with a minimal configuration (runlevel 1) and a highly optimized embedded Linux distribution, running on the IO nodes of BlueGene/L. We validate the results of our emulation both on a single node as well as on a real cluster. Our results indicate that an optimized OS along with a technique to synchronize jitter can reduce the performance degradation due to jitter from 99% (in case of the off-the-shelf Linux without any synchronization) to a much more tolerable level of 6% (in case of highly optimized BlueGene/L IO node Linux with synchronization) at 2048 processors. Furthermore, perfect synchronization can give linear scaling with less than 1% slowdown, regardless of the type of OS used. However, as the jitter at different nodes starts getting desynchronized, even with a minor skew across nodes- - , the optimized OS starts outperforming the off-the-shelf OS.
Keywords
Linux; jitter; operating systems (computers); parallel processing; AIX; ASCI Purple; BlueGene/L; HPC systems; Linux; Mare Nostrum; OS jitter; debilitating effects; large cluster scalability; trace-driven emulation; Degradation; Emulation; Interference; Jitter; Laboratories; Large-scale systems; Linux; Operating systems; Production systems; Scalability;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster Computing, 2008 IEEE International Conference on
Conference_Location
Tsukuba
ISSN
1552-5244
Print_ISBN
978-1-4244-2639-3
Electronic_ISBN
1552-5244
Type
conf
DOI
10.1109/CLUSTR.2008.4663776
Filename
4663776
Link To Document