• DocumentCode
    2126275
  • Title

    Fault-tolerance in filter-labeled-stream applications

  • Author

    Coutinho, Bruno ; Guedes, Dorgival ; Meira, Wagner, Jr. ; Ferreira, Renato A.

  • Author_Institution
    Univ. Fed. de Minas Gerais, Belo Horizonte
  • fYear
    2007
  • fDate
    24-27 Oct. 2007
  • Firstpage
    229
  • Lastpage
    236
  • Abstract
    Fault tolerance is a desirable feature in distributed high-performance systems, since applications tend to run for long periods of time and faults become more likely as the number of nodes in the system increase. However, most distributed environments lack any fault tolerant features, since they tend to be hard to implement and use, and often hurt performance dramatically. In this paper we discuss how we successfully added fault-tolerance to the Anthill distributed programming environment by using an application-level checkpoint/rollback solution. The programming model offers an abstraction where the programmer can easily identify points during the execution where the communication pattern is well defined, forming a consistent cut where checkpoints may be saved consistently without requiring extra communication, avoiding any domino effect during recovery from faults. We present the new abstractions for fault tolerance, describe how the solution was implemented and present performance results that show the efficiency of the solution with both regular and irregular applications.
  • Keywords
    checkpointing; distributed programming; fault tolerant computing; programming environments; Anthill distributed programming environment; application-level checkpoint solution; application-level rollback solution; distributed high-performance systems; fault tolerance abstractions; filter labeled stream applications; Application software; Availability; Computer architecture; Computer science; Fault diagnosis; Fault tolerance; Fault tolerant systems; High performance computing; Programming environments; Programming profession;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Architecture and High Performance Computing, 2007. SBAC-PAD 2007. 19th International Symposium on
  • Conference_Location
    Rio Grande do Sul
  • ISSN
    1550-6533
  • Print_ISBN
    978-0-7695-3014-7
  • Type

    conf

  • DOI
    10.1109/SBAC-PAD.2007.31
  • Filename
    4384062