• DocumentCode
    3585104
  • Title

    Efficient, Failure Resilient Transactions for Parallel and Distributed Computing

  • Author

    Lofstead, Jay ; Dayal, Jai ; Jimenez, Ivo ; Maltzahn, Carlos

  • Author_Institution
    Sandia Nat. Labs., Albuquerque, NM, USA
  • fYear
    2014
  • Firstpage
    17
  • Lastpage
    24
  • Abstract
    Scientific simulations are moving away from using centralized persistent storage for intermediate data between workflow steps towards an all online model. This shift is motivated by the relatively slow IO bandwidth growth compared with compute speed increases. The challenges presented by this shift to Integrated Application Workflows are motivated by the loss of persistent storage semantics for node-to-node communication. One step towards addressing this semantics gap is using transactions to logically delineate a data set from 100,000s of processes to 1000s of servers as an atomic unit. Our previously demonstrated Doubly Distributed Transactions (D2T) protocol showed a high-performance solution, but had not explored how to detect and recover from faults. Instead, the focus was on demonstrating high-performance typical case performance. The research presented here addresses fault detection and recovery based on the enhanced protocol design. The total overhead for a full transaction with multiple operations at 65,536 processes is on average 0.055 seconds. Fault detection and recovery mechanisms demonstrate similar performance to the success case with only the addition of appropriate timeouts for the system. This paper explores the challenges in designing a recoverable protocol for doubly distributed transactions, particularly for parallel computing environments.
  • Keywords
    fault diagnosis; parallel processing; storage management; centralized persistent storage; distributed computing; doubly distributed transaction protocol; failure resilient transaction; fault detection; fault recovery; integrated application workflow; node-to-node communication; parallel computing; Computational modeling; Data models; Memory; Protocols; Semantics; Servers; Standards;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Intensive Scalable Computing Systems (DISCS), 2014 International Workshop on
  • Type

    conf

  • DOI
    10.1109/DISCS.2014.13
  • Filename
    7079022