DocumentCode :
2976987
Title :
Computational resiliency for distributed applications
Author :
McGill, Kathleen ; Taylor, Stephen
Author_Institution :
Thayer Sch. of Eng., Dartmouth Coll., Hanover, NH, USA
fYear :
2011
fDate :
7-10 Nov. 2011
Firstpage :
1472
Lastpage :
1479
Abstract :
In recent years, computer network attacks have decreased overall reliability of computer systems and undermined confidence in mission-critical software. These robustness issues are magnified in distributed applications, which provide multiple points of failure and attack. The notion of resiliency is concerned with constructing applications that are able to operate through a wide variety of failures, errors, and malicious attacks. A number of approaches have been proposed in the literature based on fault tolerance achieved through replication of resources. In general, these approaches provide graceful degradation of performance to the point of failure but do not guarantee progress in the presence of multiple cascading and recurrent failures. Our approach is to dynamically replicate message-passing processes, detect inconsistencies in their behavior, and restore the level of fault tolerance as a computation proceeds. This paper describes a novel operating system technology for resilient message-passing applications that is automated, scalable, and transparent. The technology provides mechanisms for process replication, process migration, and adaptive failure detection. To quantify the performance overhead of the technology, we benchmark a distributed application exemplar to represent a broader class of applications.
Keywords :
computer network security; fault tolerant computing; message passing; operating systems (computers); safety-critical software; cascading failures; computational resiliency; computer network attacks; computer system reliability; distributed applications; fault tolerance; malicious attacks; mission-critical software; operating system technology; recurrent failures; resilient message-passing applications; Delay; Fault tolerance; Fault tolerant systems; Kernel; Libraries; Protocols; Sockets; distributed systems; failure detection; mission-assurance; process migration; process replication; resiliency;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
MILITARY COMMUNICATIONS CONFERENCE, 2011 - MILCOM 2011
Conference_Location :
Baltimore, MD
ISSN :
2155-7578
Print_ISBN :
978-1-4673-0079-7
Type :
conf
DOI :
10.1109/MILCOM.2011.6127514
Filename :
6127514
Link To Document :
بازگشت