Parallel I/O Performance for Application-Level Checkpointing on the Blue Gene/P System

Author

Fu, Jing ; Min, Misun ; Latham, Robert ; Carothers, Christopher D.

Author_Institution

Dept. of Comput. Sci., Rensselaer Polytech. Inst., Troy, NY, USA

fYear

2011

fDate

26-30 Sept. 2011

Firstpage

465

Lastpage

473

Abstract

As the number of processors increases to hundreds of thousands in parallel computer architectures, the failure probability rises correspondingly, making fault tolerance a highly important and challenging task. Application-level checkpointing is one of the most popular techniques to proactively deal with unexpected failures because of its portability and flexibility. During the checkpoint phase, the local states of the computation spread across thousands of processors are saved to stable storage. Unfortunately, this approach results in heavy I/O load and can cause an I/O bottleneck in a massively parallel system. In this paper, we examine application-level checkpointing for a massively parallel electromagnetic solver system called NekCEM on the IBM Blue Gene/P at Argonne National Laboratory. We discuss an application-level, two-phase I/O approach, called "reduced-blocking I/O" (rbIO), and a tuned MPI-IO collective approach (coIO), and we demonstrate their performance advantage over the "1 POSIX file per processor" approach. Our study shows that rbIO and coIO result in 100vó improvement over previous checkpointing approaches on up to 65,536 processors of the Blue Gene/P using the GPFS. Our study also demonstrates a 25vó production performance improvement for NekCEM. We show how to optimize parameter settings for those parallel I/O approaches and to verify results by I/O profilings. In particular, we examine the performance advantage of rbIO and demonstrate the potential benefits of this approach over the traditional MPI-IO routine, coIO.

Keywords

checkpointing; fault tolerance; parallel architectures; parallel machines; performance evaluation; probability; I/O bottleneck; IBM Blue Gene/P system; MPI-IO collective approach; NekCEM; POSIX file per processor; application-level checkpointing; failure probability; fault tolerance; heavy I/O load; massively parallel electromagnetic solver system; parallel I/O performance; parallel computer architecture; reduced-blocking I/O; Bandwidth; Checkpointing; Computer architecture; Operating systems; Production; Program processors; Semantics; Blue Gene/P; Parallel I/O; checkpointing; fault tolerance;

fLanguage

English

Publisher

ieee

Conference_Titel

Cluster Computing (CLUSTER), 2011 IEEE International Conference on

Conference_Location

Austin, TX

Print_ISBN

978-1-4577-1355-2

Electronic_ISBN

978-0-7695-4516-5

Type

conf

DOI

10.1109/CLUSTER.2011.81

Filename

6061135