DocumentCode :
3585104
Title :
Efficient, Failure Resilient Transactions for Parallel and Distributed Computing
Author :
Lofstead, Jay ; Dayal, Jai ; Jimenez, Ivo ; Maltzahn, Carlos
Author_Institution :
Sandia Nat. Labs., Albuquerque, NM, USA
fYear :
2014
Firstpage :
17
Lastpage :
24
Abstract :
Scientific simulations are moving away from using centralized persistent storage for intermediate data between workflow steps towards an all online model. This shift is motivated by the relatively slow IO bandwidth growth compared with compute speed increases. The challenges presented by this shift to Integrated Application Workflows are motivated by the loss of persistent storage semantics for node-to-node communication. One step towards addressing this semantics gap is using transactions to logically delineate a data set from 100,000s of processes to 1000s of servers as an atomic unit. Our previously demonstrated Doubly Distributed Transactions (D2T) protocol showed a high-performance solution, but had not explored how to detect and recover from faults. Instead, the focus was on demonstrating high-performance typical case performance. The research presented here addresses fault detection and recovery based on the enhanced protocol design. The total overhead for a full transaction with multiple operations at 65,536 processes is on average 0.055 seconds. Fault detection and recovery mechanisms demonstrate similar performance to the success case with only the addition of appropriate timeouts for the system. This paper explores the challenges in designing a recoverable protocol for doubly distributed transactions, particularly for parallel computing environments.
Keywords :
fault diagnosis; parallel processing; storage management; centralized persistent storage; distributed computing; doubly distributed transaction protocol; failure resilient transaction; fault detection; fault recovery; integrated application workflow; node-to-node communication; parallel computing; Computational modeling; Data models; Memory; Protocols; Semantics; Servers; Standards;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Intensive Scalable Computing Systems (DISCS), 2014 International Workshop on
Type :
conf
DOI :
10.1109/DISCS.2014.13
Filename :
7079022
Link To Document :
بازگشت