Title :
Applications resilience on clouds
Author :
Nguyên, Toàn ; Désidéri, Jean-Antoine ; Trifan, Laurentiu
Author_Institution :
Project OPALE, INRIA, St. Ismier, France
Abstract :
Cloud computing infrastructures support system and network fault-tolerance. They transparently repair and prevent communication and software errors. They also allow duplication and migration of jobs and data to prevent hardware failures. However, only limited work has been done so far on application resilience, i.e., the ability to resume normal execution after errors and abnormal executions in distributed environments and clouds. This paper addresses open issues and solutions for application errors detection and management. It also overviews a testbed used to to design, deploy, execute, monitor, restart and resume distributed applications on cloud infrastructures in cases of failures.
Keywords :
cloud computing; software fault tolerance; abnormal executions; application errors detection; application errors management; applications resilience; cloud computing infrastructures; communication errors; distributed environments; hardware failures; network fault tolerance; software errors; Checkpointing; Fault tolerance; Fault tolerant systems; Hardware; Resilience; Software; Transient analysis; Cloud Computing; High-Performance Computing; Resilience; Scientific Applications; Workflows;
Conference_Titel :
High Performance Computing and Simulation (HPCS), 2012 International Conference on
Conference_Location :
Madrid
Print_ISBN :
978-1-4673-2359-8
DOI :
10.1109/HPCSim.2012.6266891