• DocumentCode
    2966557
  • Title

    Improving the scalability of transparent checkpointing for GPU computing systems

  • Author

    Amrizal, A. ; Hirasawa, Shoichi ; Komatsu, Kazuhiko ; Takizawa, Hiroyuki ; Kobayashi, Hideo

  • Author_Institution
    Grad. Sch. of Inf. Sci., Tohoku Univ., Sendai, Japan
  • fYear
    2012
  • fDate
    19-22 Nov. 2012
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    As the number of nodes in a GPU computing system increases, checkpointing to a global file system becomes more time-consuming due to the I/O bottlenecks and network congestion. To solve this problem, in this paper, we propose a transparent and scalable checkpoint/restart mechanism for OpenCL applications, named Two-level CheCL. As its name implies, Two-level CheCL consists of two different checkpoint implementations, Local CheCL and Global CheCL. Local CheCL avoids checkpointing to the global file system by utilizing node´s local storage. Our experimental results show that Local CheCL can accelerate the checkpointing process by up to four times faster than a conventional checkpointing mechanism. We also implement Global CheCL, which utilizes a global file system, to make sure that we always have a global checkpoint file even in the case of a catastrophic failure. We discuss the performance of our proposed mechanism through an analysis with a two-level checkpoint model.
  • Keywords
    application program interfaces; checkpointing; fault tolerant computing; file organisation; graphics processing units; input-output programs; GPU computing systems; I-O bottlenecks; OpenCL applications; catastrophic failure; fault tolerance technique; global CheCL; global checkpoint file; global file system; local CheCL; network congestion; node local storage utilization; restart mechanism; scalability improvement; scalable checkpoint mechanism; transparent checkpointing process; two-level CheCL; two-level checkpoint model; Benchmark testing; Checkpointing; Computational modeling; Graphics processing units; Mathematical model; Random access memory; Scalability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    TENCON 2012 - 2012 IEEE Region 10 Conference
  • Conference_Location
    Cebu
  • ISSN
    2159-3442
  • Print_ISBN
    978-1-4673-4823-2
  • Electronic_ISBN
    2159-3442
  • Type

    conf

  • DOI
    10.1109/TENCON.2012.6412343
  • Filename
    6412343