• DocumentCode
    2441088
  • Title

    Scalable failure recovery for high-performance data aggregation

  • Author

    Arnold, Dorian C. ; Miller, Barton P.

  • Author_Institution
    Dept. of Comput. Sci., Univ. of New Mexico, Albuquerque, NM, USA
  • fYear
    2010
  • fDate
    19-23 April 2010
  • Firstpage
    1
  • Lastpage
    11
  • Abstract
    Many high-performance tools, applications and infrastructures, such as Paradyn, STAT, TAU, Ganglia, SuperMon, Astrolabe, Borealis, and MRNet, use data aggregation to synthesize large data sets and reduce data volumes while retaining relevant information content. Hierarchical or tree-based overlay networks (TBONs) are often used to execute data aggregation operations in a scalable, piecewise fashion. In this paper, we present state compensation, a scalable failure recovery model for high-bandwidth, low-latency TBON computations. By leveraging inherently redundant state information found in many TBON computations, state compensation avoids explicit state replication (for example, process checkpoints and message logging) and incurs no overhead in the absence of failures. Further, when failures do occur, state compensation uses a weak data consistency model and localized protocols that allow processes to recover from failures independently and responsively. Based on a formal specification of our data aggregation model, we have validated state compensation and identified its assumptions and limitations: state compensation requires that data aggregation operations be associative, commutative and idempotent. In this paper, we describe the fundamental state compensation concepts and a prototype implementation integrated into the MRNet TBON infrastructure. Our experiments with this framework suggest that for TBONs supporting up to millions of application processes, state compensation can yield millisecond recovery latencies and inconsequential application perturbation.
  • Keywords
    data handling; trees (mathematics); formal specification; high-performance data aggregation; localized protocols; recovery latencies; scalable failure recovery; scalable failure recovery model; tree-based overlay networks; Computational modeling; Computer networks; Data analysis; Delay; Distributed computing; Large-scale systems; Network synthesis; Protocols; Prototypes; Synchronization; large scale computing; robust data aggregation; tree-based overlay networks;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on
  • Conference_Location
    Atlanta, GA
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-4244-6442-5
  • Type

    conf

  • DOI
    10.1109/IPDPS.2010.5470432
  • Filename
    5470432