• DocumentCode
    46545
  • Title

    On Timely Staging of HPC Job Input Data

  • Author

    Monti, H.M. ; Butt, Ali R. ; Vazhkudai, Sudharshan S.

  • Author_Institution
    Dept. of Comput. Sci., Virginia Tech, Blacksburg, VA, USA
  • Volume
    24
  • Issue
    9
  • fYear
    2013
  • fDate
    Sept. 2013
  • Firstpage
    1841
  • Lastpage
    1851
  • Abstract
    Innovative scientific applications and emerging dense data sources are creating a data deluge for high-end supercomputing systems. Modern applications are often collaborative in nature, with a distributed user base for input and output data sets. Processing such large input data typically involves copying (or staging) the data onto the supercomputer´s specialized high-speed storage, scratch space, for sustained high I/O throughput. This copying is crucial as remotely accessing the data while an application executes results in unnecessary delays and consequently performance degradation. However, the current practice of conservatively staging data as early as possible makes the data vulnerable to storage failures, which may entail restaging and reduced job throughput. To address this, we present a timely staging framework that uses a combination of job start-up time predictions, user-specified volunteer or cloud-based intermediate storage nodes, and decentralized data delivery to coincide input data staging with job start-up. Evaluation of our approach using both PlanetLab and Azure cloud services, as well as simulations based on three years of Jaguar supercomputer (No. 3 in Top500) job logs show as much as 91.0 percent reduction in staging times compared to direct transfers, 75.2 percent reduction in wait time on scratch, and 2.4 percent reduction in usage/hour. (An earlier version of this paper appears in [30].).
  • Keywords
    cloud computing; data handling; parallel processing; resource allocation; Azure cloud service; HPC job input data; Jaguar supercomputer; PlanetLab cloud service; cloud-based intermediate storage node; data copying; data processing; high performance computing; high-end supercomputing system; job start-up time prediction; job throughput; user-specified volunteer; Bandwidth; Cloud computing; Delay; Distributed databases; Materials; Supercomputers; HPC center serviceability; High performance data management; data-staging; end-user data delivery;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2012.279
  • Filename
    6311399