Title :
Crosscheck: Hardening Replicated Multithreaded Services
Author :
Martens, Alke ; Borchert, Christoph ; Geissler, Tobias Oliver ; Lohmann, Daniel ; Spinczyk, Olaf ; Kapitza, R.
Abstract :
State-machine replication has received widespread attention for the provisioning of highly available services in data centers. However, current production systems focus on tolerating crash faults only and prominent service outages caused by state corruptions have indicated that this is a risky strategy. In the future, state corruptions due to transient faults (such as bit flips) become even more likely, caused by ongoing hardware trends regarding the shrinking of structure sizes and reduction of operating voltages. In this paper we present Crosscheck, an approach to tolerate arbitrary state corruption (ASC) in the context of fault-tolerant replication of multithreaded services. Crosscheck is able to detect silent data corruptions ahead of execution, and by crosschecking state changes with co-executing replicas, even ASCs can be detected. Finally, fault tolerance is achieved by a fine-grained recovery using fault-free replicas. Our implementation is transparent to the application by utilizing fine-grained software-hardening mechanisms using aspect-oriented programming. To validate Crosscheck we present a replicated multithreaded key-value store that is resilient to state corruptions.
Keywords :
aspect-oriented programming; computer centres; fault tolerant computing; multi-threading; system recovery; ASC tolerance; CROSSCHECK; arbitrary state corruption tolerance; aspect-oriented programming; bit flips; crash fault; data centers; fault-free replica; fault-tolerant replication; fine-grained recovery; fine-grained software-hardening mechanism; hardware trend; highly available services; operating voltage reduction; replicated multithreaded key-value; replicated multithreaded service hardening; service outage; silent data corruption detection; state corruption resilience; state-machine replication; structure size shrinking; transient fault; Computer crashes; Hardware; Message systems; Object oriented modeling; Production systems; Redundancy; Slabs; AspectC++; Determinism; Multithreading; Replication; Software Error Hardening;
Conference_Titel :
Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on
Conference_Location :
Atlanta, GA
DOI :
10.1109/DSN.2014.98