Title :
MCREngine: A scalable checkpointing system using data-aware aggregation and compression
Author :
Islam, Tanzima Zerin ; Mohror, Kathryn ; Bagchi, Saurabh ; Moody, Adam ; de Supinski, Bronis R. ; Eigenmann, Rudi
Author_Institution :
Sch. of Electr. & Comput. Eng., Purdue Univ., West Lafayette, IN, USA
Abstract :
High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpointrestart system, MCRENGINE. MCRENGINE aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that MCRENGINE reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.
Keywords :
checkpointing; data compression; parallel processing; MCREngine; PFS; checkpoint frequency; data compression; data semantics; data-aware aggregation; high performance computing systems; large-scale application checkpoints; parallel file system; scalable checkpointing system; Arrays; Checkpointing; Computer numerical control; Libraries; Message systems; Reactive power; Transceivers;
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for
Conference_Location :
Salt Lake City, UT
Print_ISBN :
978-1-4673-0805-2