DocumentCode
3650208
Title
The design and implementation of a multi-level content-addressable checkpoint file system
Author
Abhishek Kulkarni;Adam Manzanares;Latchesar Ionkov;Michael Lang;Andrew Lumsdaine
Author_Institution
Indiana University
fYear
2012
Firstpage
1
Lastpage
10
Abstract
Long-running HPC applications guard against node failures by writing checkpoints to parallel file systems. Writing these checkpoints with petascale class machines has proven difficult and the increased concurrency demands of exascale computing will exacerbate this problem. To meet checkpointing demands and sustain application-perceived throughput at exascale, multi-tiered hierarchical storage architectures involving solid-state burst buffers are being considered. In this paper, we describe the design and implementation of cento, a multi-level, content-addressable checkpoint file system for large-scale HPC systems. cento achieves in-flight checkpoint data reduction across all compute nodes through compression and elimination of duplicate blocks over a series of checkpoints. Through a detailed analysis of checkpoint dumps, we assess the benefits of data reduction for scientific applications that are representative of production workloads. We observe upto 40% data reduction within a limited sample of representative workloads. Finally, experiments on existing systems show a decrease in checkpoint commit latencies by 5 to 20 % reducing the load on the parallel file system.
Publisher
ieee
Conference_Titel
High Performance Computing (HiPC), 2012 19th International Conference on
Print_ISBN
978-1-4673-2372-7
Type
conf
DOI
10.1109/HiPC.2012.6507514
Filename
6507514
Link To Document