• DocumentCode
    2552801
  • Title

    A scalable and efficient self-organizing failure detector for grid applications

  • Author

    Horita, Yuuki ; Taura, Kenjiro ; Chikayama, Takashi

  • Author_Institution
    Tokyo Univ., Japan
  • fYear
    2005
  • fDate
    13-14 Nov. 2005
  • Abstract
    Failure detection and group membership management are basic building blocks for self-repairing systems in distributed environments, which need to be scalable, reliable, and efficient in practice. As available resources become larger in size and more widely distributed, it is more essential that they can be easily used with a small amount of manual configuration in grid environments, where connectivities between different networks may be limited by firewalls and NATs. In this paper, we present a scalable failure detection protocol that self-organizes in grid environments. Our failure detectors autonomously create dispersed monitoring relationships among participating processes with almost no manual configuration so that each process will be monitored by a small number of other processes, and quickly disseminate notifications along the monitoring relationships when failures are detected. With simulations and real experiments, we showed that our failure detector has a practical scalability, a high reliability, and a good efficiency. The overhead with 313 processes was at most 2-percent even when the heartbeat interval was set to 0.1 second, and accordingly smaller when it was longer.
  • Keywords
    computer networks; failure analysis; fault diagnosis; fault tolerant computing; grid computing; groupware; transport protocols; distributed environments; failure detection protocol; grid applications; grid environments; group membership management; notification dissemination; process relationship monitoring; self-organizing failure detector; self-repairing systems; Computer crashes; Condition monitoring; Detectors; Environmental management; Fault detection; Heart beat; Libraries; Network address translation; Protocols; Scalability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Grid Computing, 2005. The 6th IEEE/ACM International Workshop on
  • Print_ISBN
    0-7803-9492-5
  • Type

    conf

  • DOI
    10.1109/GRID.2005.1542743
  • Filename
    1542743