Title :
Flexible Error Recovery Using Versions in Global View Resilience
Author :
Nan Dun;Hajime Fujita;Aiman Fang;Yan Liu;Andrew A. Chien;Pavan Balaj;Kamil Iskra;Wesley Bland;Andrew Siegel
Abstract :
We present the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We briefly describe GVR´s interfaces for distributed arrays, versioning, and cross-layer error recovery. We illustrate how GVR can be used for rollback recovery and a wide range additional error recovery techniques including forward recovery for latent errors or silent data corruptions. Application results demonstrate that GVR´s interfaces and implementation are portable, flexible (support a variety of recovery models), efficient and create a gentle-slope path to tolerate growing error rates in future systems.
Keywords :
"Resilience","Arrays","Error analysis","Forward error correction","Neutrons","Runtime","Monte Carlo methods"
Conference_Titel :
Cluster Computing (CLUSTER), 2015 IEEE International Conference on
DOI :
10.1109/CLUSTER.2015.88