DocumentCode :
3428306
Title :
MCREngine: A scalable checkpointing system using data-aware aggregation and compression
Author :
Islam, Tanzima Zerin ; Mohror, Kathryn ; Bagchi, Saurabh ; Moody, Adam ; de Supinski, Bronis R. ; Eigenmann, Rudi
Author_Institution :
Sch. of Electr. & Comput. Eng., Purdue Univ., West Lafayette, IN, USA
fYear :
2012
fDate :
10-16 Nov. 2012
Firstpage :
1
Lastpage :
11
Abstract :
High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpointrestart system, MCRENGINE. MCRENGINE aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that MCRENGINE reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.
Keywords :
checkpointing; data compression; parallel processing; MCREngine; PFS; checkpoint frequency; data compression; data semantics; data-aware aggregation; high performance computing systems; large-scale application checkpoints; parallel file system; scalable checkpointing system; Arrays; Checkpointing; Computer numerical control; Libraries; Message systems; Reactive power; Transceivers;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for
Conference_Location :
Salt Lake City, UT
ISSN :
2167-4329
Print_ISBN :
978-1-4673-0805-2
Type :
conf
DOI :
10.1109/SC.2012.77
Filename :
6468462
Link To Document :
بازگشت