DocumentCode :
3650208
Title :
The design and implementation of a multi-level content-addressable checkpoint file system
Author :
Abhishek Kulkarni;Adam Manzanares;Latchesar Ionkov;Michael Lang;Andrew Lumsdaine
Author_Institution :
Indiana University
fYear :
2012
Firstpage :
1
Lastpage :
10
Abstract :
Long-running HPC applications guard against node failures by writing checkpoints to parallel file systems. Writing these checkpoints with petascale class machines has proven difficult and the increased concurrency demands of exascale computing will exacerbate this problem. To meet checkpointing demands and sustain application-perceived throughput at exascale, multi-tiered hierarchical storage architectures involving solid-state burst buffers are being considered. In this paper, we describe the design and implementation of cento, a multi-level, content-addressable checkpoint file system for large-scale HPC systems. cento achieves in-flight checkpoint data reduction across all compute nodes through compression and elimination of duplicate blocks over a series of checkpoints. Through a detailed analysis of checkpoint dumps, we assess the benefits of data reduction for scientific applications that are representative of production workloads. We observe upto 40% data reduction within a limited sample of representative workloads. Finally, experiments on existing systems show a decrease in checkpoint commit latencies by 5 to 20 % reducing the load on the parallel file system.
Publisher :
ieee
Conference_Titel :
High Performance Computing (HiPC), 2012 19th International Conference on
Print_ISBN :
978-1-4673-2372-7
Type :
conf
DOI :
10.1109/HiPC.2012.6507514
Filename :
6507514
Link To Document :
بازگشت