Abstract :
The combination of decreasing device reliability due to deep submicron scaling, increasing integration, and the size of future exascale high-performance computers and cloud datacenters pose significant challenges for system resilience. Furthermore, with power and cost being of critical importance, resilience must be provided efficiently and economically. Although providing resilience will require a range of approaches at all levels of the system stack, the final responsibility rests at the system level. In addition to highlighting challenges, this talk reviews and introduces promising system-level techniques such as configurable isolation, duplication caching, multicore DIMMs, CoVeRT, and 3D checkpointing.
Keywords :
computer centres; scaling circuits; semiconductor device reliability; 3D checkpointing; CoVeRT; cloud datacenters; configurable isolation; deep submicron scaling; device reliability; duplication caching; exascale high-performance computers; multicore DIMM; system level techniques; system resilience; CMOS technology; Cloud computing; Computer architecture; Fault tolerant systems; Microprocessors; Power system reliability; Resilience; Technological innovation; Timing; Very large scale integration; Resilience; checkpointing; duplication; exascale systems; isolation;