On Timely Staging of HPC Job Input Data

Author

Monti, H.M. ; Butt, Ali R. ; Vazhkudai, Sudharshan S.

Author_Institution

Dept. of Comput. Sci., Virginia Tech, Blacksburg, VA, USA

Volume

24

Issue

9

fYear

2013

fDate

Sept. 2013

Firstpage

1841

Lastpage

1851

Abstract

Innovative scientific applications and emerging dense data sources are creating a data deluge for high-end supercomputing systems. Modern applications are often collaborative in nature, with a distributed user base for input and output data sets. Processing such large input data typically involves copying (or staging) the data onto the supercomputer´s specialized high-speed storage, scratch space, for sustained high I/O throughput. This copying is crucial as remotely accessing the data while an application executes results in unnecessary delays and consequently performance degradation. However, the current practice of conservatively staging data as early as possible makes the data vulnerable to storage failures, which may entail restaging and reduced job throughput. To address this, we present a timely staging framework that uses a combination of job start-up time predictions, user-specified volunteer or cloud-based intermediate storage nodes, and decentralized data delivery to coincide input data staging with job start-up. Evaluation of our approach using both PlanetLab and Azure cloud services, as well as simulations based on three years of Jaguar supercomputer (No. 3 in Top500) job logs show as much as 91.0 percent reduction in staging times compared to direct transfers, 75.2 percent reduction in wait time on scratch, and 2.4 percent reduction in usage/hour. (An earlier version of this paper appears in [30].).

Keywords

cloud computing; data handling; parallel processing; resource allocation; Azure cloud service; HPC job input data; Jaguar supercomputer; PlanetLab cloud service; cloud-based intermediate storage node; data copying; data processing; high performance computing; high-end supercomputing system; job start-up time prediction; job throughput; user-specified volunteer; Bandwidth; Cloud computing; Delay; Distributed databases; Materials; Supercomputers; HPC center serviceability; High performance data management; data-staging; end-user data delivery;

fLanguage

English

Journal_Title

Parallel and Distributed Systems, IEEE Transactions on

Publisher

ieee

ISSN

1045-9219

Type

jour

DOI

10.1109/TPDS.2012.279

Filename

6311399