• DocumentCode
    2368595
  • Title

    Probability model for faults in large-scale multicomputer systems

  • Author

    Wang, Gaocai ; Chen, Jianer ; Wang, Guojun ; Songqiao Chen

  • Author_Institution
    Coll. of Inf. Sci. & Eng., Central South Univ. Changsha, Hunan, China
  • fYear
    2003
  • fDate
    16-19 Nov. 2003
  • Firstpage
    452
  • Lastpage
    457
  • Abstract
    Reliability and availability are critical when faults appear in the design of large multicomputer systems. On the other hand, it is very difficult to predict the reliability and availability of multicomputer systems. In this paper, we study the reliability and availability of large multicomputer systems under a more realistic model in which each network node has an independent failure probability. We mainly consider the reliability and availability of large mesh-connected multicomputer systems. The metric is connectivity probability of networks. In a previous work (J. Chen and T. Wang, Proc. 14th Int. Conf. Parallel and Distr. Comp. and Sys., pp. 606-611, 2002), we proved that if the node failure probability is fixed, then the connectivity probability of mesh networks can be arbitrarily small when the network size is sufficiently large. Thus, it is practically important for multicomputer system manufacturers to determine the upper bound for node failure probability, when the probability of network connectivity and the network size are given. We develop another novel technique to formally derive lower bounds on the connectivity probability for mesh networks. Our study shows that mesh networks of practical size can tolerate a large number of faulty nodes and thus are reliable enough for multicomputer systems. For example, we formally prove that as long as the node failure probability is bounded by 0.09% (note that according to current VLSI technology, building network nodes with failure probability under 0.09% is achievable), mesh networks of up to a million nodes remain connected with a probability larger than 99%. The results for mesh network reliability and availability are obtained by formal and thorough mathematical proofs.
  • Keywords
    computer network reliability; failure analysis; fault tolerant computing; multiprocessor interconnection networks; probability; VLSI technology; connectivity probability; fault probability model; faulty nodes; large-scale multicomputer systems; mathematical proofs; mesh network availability; mesh network reliability; mesh-connected multicomputer systems; multicomputer systems model; network connectivity probability; network node independent failure probability; network size; Availability; Computer fault tolerance; Computer network reliability; Design engineering; Educational institutions; Failure analysis; Fault tolerance; Information science; Large-scale systems; Manufacturing; Mesh networks; Multiprocessor interconnection; Network topology; Probability; Reliability engineering;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Test Symposium, 2003. ATS 2003. 12th Asian
  • ISSN
    1081-7735
  • Print_ISBN
    0-7695-1951-2
  • Type

    conf

  • DOI
    10.1109/ATS.2003.1250855
  • Filename
    1250855