DocumentCode
46545
Title
On Timely Staging of HPC Job Input Data
Author
Monti, H.M. ; Butt, Ali R. ; Vazhkudai, Sudharshan S.
Author_Institution
Dept. of Comput. Sci., Virginia Tech, Blacksburg, VA, USA
Volume
24
Issue
9
fYear
2013
fDate
Sept. 2013
Firstpage
1841
Lastpage
1851
Abstract
Innovative scientific applications and emerging dense data sources are creating a data deluge for high-end supercomputing systems. Modern applications are often collaborative in nature, with a distributed user base for input and output data sets. Processing such large input data typically involves copying (or staging) the data onto the supercomputer´s specialized high-speed storage, scratch space, for sustained high I/O throughput. This copying is crucial as remotely accessing the data while an application executes results in unnecessary delays and consequently performance degradation. However, the current practice of conservatively staging data as early as possible makes the data vulnerable to storage failures, which may entail restaging and reduced job throughput. To address this, we present a timely staging framework that uses a combination of job start-up time predictions, user-specified volunteer or cloud-based intermediate storage nodes, and decentralized data delivery to coincide input data staging with job start-up. Evaluation of our approach using both PlanetLab and Azure cloud services, as well as simulations based on three years of Jaguar supercomputer (No. 3 in Top500) job logs show as much as 91.0 percent reduction in staging times compared to direct transfers, 75.2 percent reduction in wait time on scratch, and 2.4 percent reduction in usage/hour. (An earlier version of this paper appears in [30].).
Keywords
cloud computing; data handling; parallel processing; resource allocation; Azure cloud service; HPC job input data; Jaguar supercomputer; PlanetLab cloud service; cloud-based intermediate storage node; data copying; data processing; high performance computing; high-end supercomputing system; job start-up time prediction; job throughput; user-specified volunteer; Bandwidth; Cloud computing; Delay; Distributed databases; Materials; Supercomputers; HPC center serviceability; High performance data management; data-staging; end-user data delivery;
fLanguage
English
Journal_Title
Parallel and Distributed Systems, IEEE Transactions on
Publisher
ieee
ISSN
1045-9219
Type
jour
DOI
10.1109/TPDS.2012.279
Filename
6311399
Link To Document