• DocumentCode
    1886347
  • Title

    Compiler assisted fault detection for distributed-memory systems

  • Author

    Gong, Chun ; Melhem, Rani ; Gupta, Rajiv

  • Author_Institution
    Dept. of Comput. Sci., Pittsburgh Univ., PA, USA
  • fYear
    1994
  • fDate
    23-25 May 1994
  • Firstpage
    373
  • Lastpage
    380
  • Abstract
    Distributed-memory systems provide the most promising performance to cost ratio for multiprocessor computers due to their scalability. However the issues of fault detection and fault tolerance are critical in such systems since the probability of having faulty components increases with the number of processors. We propose a methodology for fault detection through compiler support. More specifically, we augment the single-program multiple-data (SPMD) execution model to duplicate selected data items in such a way that during execution, whenever a value of a duplicated data is computed, the owners of the data are tested. The proposed compiler assisted fault detection technique does not require any specialized hardware and allows for a selective choice of redundancy at compile time
  • Keywords
    computer debugging; distributed memory systems; fault tolerant computing; program compilers; reliability; software reliability; compile time; compiler assisted fault detection; data item duplication; distributed-memory systems; fault tolerance; multiprocessor computers; performance to cost ratio; probability; redundancy; scalability; single-program multiple-data execution model; specialized hardware; Computer science; Costs; Distributed computing; Fault detection; Fault tolerance; Fault tolerant systems; Hardware; Multiprocessing systems; Redundancy; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Scalable High-Performance Computing Conference, 1994., Proceedings of the
  • Conference_Location
    Knoxville, TN
  • Print_ISBN
    0-8186-5680-8
  • Type

    conf

  • DOI
    10.1109/SHPCC.1994.296667
  • Filename
    296667