Title :
Distributed multicomputer system availability based on measurements: A case study
Author_Institution :
Digital Equipiment Corp., Maynard, MA, USA
Abstract :
The author presents an experimental approach to evaluating the availability of distributed multicomputer systems. The measurement of a distributed system was conducted in an operational environment. To understand system failure behavior, all host computer restarts and their causes were collected. There was no centralized automatic logging mechanism. Data were collected from each individual computer. The method proposed to identify multiple-failure events from ERRLOG data of 14 VAX hosts is based on the moving window technique and possibility reasoning. The proposed rules, although very simple and focusing only on high-level reasoning, demonstrate a framework of using possibility reasoning for decision making. This study was conducted on a large scale VAXcluster system. Results showed that about 55% of restarts were due to dependent failures and most of them were scheduled orderly shutdowns. System availability was then estimated from a performance aspect
Keywords :
distributed processing; multiprocessing systems; performance evaluation; ERRLOG data; VAX hosts; decision making; distributed multicomputer system availability; possibility reasoning; system failure behavior; window technique; Availability; Computer aided software engineering; Control systems; Distributed computing; Hardware; Operating systems; System performance; Time measurement; Topology; Voice mail;
Conference_Titel :
Computers and Communications, 1991. Conference Proceedings., Tenth Annual International Phoenix Conference on
Conference_Location :
Scottsdale, AZ
Print_ISBN :
0-8186-2133-8
DOI :
10.1109/PCCC.1991.113795