مرکز منطقه ای اطلاع رساني علوم و فناوري - Fast checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on multicore architecture

DocumentCode :

1872840

Title :

Fast checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on multicore architecture

Author :

Ouyang, Xiangyong ; Gopalakrishnan, Karthik ; Gangadharappa, Tejus ; Panda, Dhabaleswar K.

Author_Institution :

Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA

fYear :

2009

fDate :

16-19 Dec. 2009

Firstpage :

Lastpage :

108

Abstract :

Large scale compute clusters continue to grow to ever-increasing proportions. However, as clusters and applications continue to grow, the Mean Time Between Failures (MTBF) has reduced from days to hours. As a result, fault tolerance within the cluster has become imperative. MPI, the de-facto standard for parallel programming, is widely used on such large clusters. Many MPI implementations use Checkpoint/Restart schemes using the Berkeley Lab Checkpoint Restart (BLCR) Library to achieve some level of fault tolerance. However, the performance of the Checkpoint/Restart mechanism does not scale well with increasing job size. As a result, the deployment of Checkpoint/Restart mechanisms for large scale parallel applications is compromised. In our previous work, we proposed a technique to aggregate certain categories of checkpoint writes to reduce the checkpointing overhead. However, an application still experiences slow checkpoint writing because it is blocked waiting for its checkpoint file writes to complete. In this paper, we propose the Write Aggregation with Dynamic Buffer and Interleaving scheme to reduce the overhead related to checkpoint creation. By aggregating all checkpoint writes into a dynamic buffer pool and overlapping the application progress with the file writes, our algorithm is able to significantly reduce checkpoint creation overhead. In the experiments using 64 processor cores, our design demonstrates a speedup of 2.62 times in terms of checkpoint creation time when compared to the original BLCR design. Our scheme also reduces the impact of checkpointing on the application execution time from 20% to 6% when 3 checkpoints are taken during an application run.

Keywords :

buffer storage; checkpointing; fault tolerant computing; parallel architectures; parallel programming; pattern clustering; BLCR; Berkeley lab checkpoint restart; MPI; MTBF; de-facto standard; dynamic buffer; fast checkpointing; fault tolerance; large scale compute clusters; mean time between failures; multicore architecture; parallel programming; write aggregation; Aggregates; Checkpointing; Computer architecture; Fault tolerance; Interleaved codes; Large-scale systems; Libraries; Multicore processing; Parallel programming; Writing;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

High Performance Computing (HiPC), 2009 International Conference on

Conference_Location :

Kochi

Print_ISBN :

978-1-4244-4922-4

Electronic_ISBN :

978-1-4244-4921-7

Type :

conf

DOI :

10.1109/HIPC.2009.5433218

Filename :

5433218

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1872840