• DocumentCode
    2052227
  • Title

    Parallel I/O Performance for Application-Level Checkpointing on the Blue Gene/P System

  • Author

    Fu, Jing ; Min, Misun ; Latham, Robert ; Carothers, Christopher D.

  • Author_Institution
    Dept. of Comput. Sci., Rensselaer Polytech. Inst., Troy, NY, USA
  • fYear
    2011
  • fDate
    26-30 Sept. 2011
  • Firstpage
    465
  • Lastpage
    473
  • Abstract
    As the number of processors increases to hundreds of thousands in parallel computer architectures, the failure probability rises correspondingly, making fault tolerance a highly important and challenging task. Application-level checkpointing is one of the most popular techniques to proactively deal with unexpected failures because of its portability and flexibility. During the checkpoint phase, the local states of the computation spread across thousands of processors are saved to stable storage. Unfortunately, this approach results in heavy I/O load and can cause an I/O bottleneck in a massively parallel system. In this paper, we examine application-level checkpointing for a massively parallel electromagnetic solver system called NekCEM on the IBM Blue Gene/P at Argonne National Laboratory. We discuss an application-level, two-phase I/O approach, called "reduced-blocking I/O" (rbIO), and a tuned MPI-IO collective approach (coIO), and we demonstrate their performance advantage over the "1 POSIX file per processor" approach. Our study shows that rbIO and coIO result in 100vó improvement over previous checkpointing approaches on up to 65,536 processors of the Blue Gene/P using the GPFS. Our study also demonstrates a 25vó production performance improvement for NekCEM. We show how to optimize parameter settings for those parallel I/O approaches and to verify results by I/O profilings. In particular, we examine the performance advantage of rbIO and demonstrate the potential benefits of this approach over the traditional MPI-IO routine, coIO.
  • Keywords
    checkpointing; fault tolerance; parallel architectures; parallel machines; performance evaluation; probability; I/O bottleneck; IBM Blue Gene/P system; MPI-IO collective approach; NekCEM; POSIX file per processor; application-level checkpointing; failure probability; fault tolerance; heavy I/O load; massively parallel electromagnetic solver system; parallel I/O performance; parallel computer architecture; reduced-blocking I/O; Bandwidth; Checkpointing; Computer architecture; Operating systems; Production; Program processors; Semantics; Blue Gene/P; Parallel I/O; checkpointing; fault tolerance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing (CLUSTER), 2011 IEEE International Conference on
  • Conference_Location
    Austin, TX
  • Print_ISBN
    978-1-4577-1355-2
  • Electronic_ISBN
    978-0-7695-4516-5
  • Type

    conf

  • DOI
    10.1109/CLUSTER.2011.81
  • Filename
    6061135