Title :
On Timely Staging of HPC Job Input Data
Author :
Monti, H.M. ; Butt, Ali R. ; Vazhkudai, Sudharshan S.
Author_Institution :
Dept. of Comput. Sci., Virginia Tech, Blacksburg, VA, USA
Abstract :
Innovative scientific applications and emerging dense data sources are creating a data deluge for high-end supercomputing systems. Modern applications are often collaborative in nature, with a distributed user base for input and output data sets. Processing such large input data typically involves copying (or staging) the data onto the supercomputer´s specialized high-speed storage, scratch space, for sustained high I/O throughput. This copying is crucial as remotely accessing the data while an application executes results in unnecessary delays and consequently performance degradation. However, the current practice of conservatively staging data as early as possible makes the data vulnerable to storage failures, which may entail restaging and reduced job throughput. To address this, we present a timely staging framework that uses a combination of job start-up time predictions, user-specified volunteer or cloud-based intermediate storage nodes, and decentralized data delivery to coincide input data staging with job start-up. Evaluation of our approach using both PlanetLab and Azure cloud services, as well as simulations based on three years of Jaguar supercomputer (No. 3 in Top500) job logs show as much as 91.0 percent reduction in staging times compared to direct transfers, 75.2 percent reduction in wait time on scratch, and 2.4 percent reduction in usage/hour. (An earlier version of this paper appears in [30].).
Keywords :
cloud computing; data handling; parallel processing; resource allocation; Azure cloud service; HPC job input data; Jaguar supercomputer; PlanetLab cloud service; cloud-based intermediate storage node; data copying; data processing; high performance computing; high-end supercomputing system; job start-up time prediction; job throughput; user-specified volunteer; Bandwidth; Cloud computing; Delay; Distributed databases; Materials; Supercomputers; HPC center serviceability; High performance data management; data-staging; end-user data delivery;
Journal_Title :
Parallel and Distributed Systems, IEEE Transactions on
DOI :
10.1109/TPDS.2012.279