• DocumentCode
    2139957
  • Title

    Building Single Fault Survivable Parallel Algorithms for Matrix Operations Using Redundant Parallel Computation

  • Author

    Du, Yunfei ; Wang, Panfeng ; Fu, Hongyi ; Jia, Jia ; Zhou, Haifang ; Yang, Xuejun

  • Author_Institution
    Nat. Univ. of Defense Technol., Changsha
  • fYear
    2007
  • fDate
    16-19 Oct. 2007
  • Firstpage
    285
  • Lastpage
    290
  • Abstract
    As the size of today´s high performance computers continue to grow, node failures in these computers are becoming frequent events. Although checkpoint is the typical technique to tolerate such failures, it often introduces a considerable overhead and has shown poor scalability on today´s large scale systems. In this paper we defined a new term called fault tolerant parallel algorithm which means that the algorithm gets the correct answer despite the failure of nodes. The fault tolerance approach in which the data of failed processes is recovered by modifying applications to recompute on all surviving processes is checkpoint-free. In particular, if no failure occurs, the fault tolerant parallel algorithms are the same as the original algorithms. We show the practicality of this technique by applying it to parallel dense matrix-matrix multiplication and Gaussian elimination to tolerate single process failure. Experimental results demonstrate that a process failure can be tolerated with a good scalability for the two fault tolerant parallel algorithms and the proposed fault tolerant parallel dense matrix-matrix multiplication is able to survive process failure with a very low performance overhead. The main drawback of this approach is non-transparent and algorithm-dependent.
  • Keywords
    Gaussian processes; fault tolerant computing; matrix algebra; parallel algorithms; Gaussian elimination; fault survivable parallel algorithms; fault tolerance approach; matrix operations; parallel dense matrix-matrix multiplication; redundant parallel computation; Concurrent computing; Distributed computing; Distributed processing; Fault tolerance; Hardware; High performance computing; Information technology; Laboratories; Parallel algorithms; Scalability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer and Information Technology, 2007. CIT 2007. 7th IEEE International Conference on
  • Conference_Location
    Aizu-Wakamatsu, Fukushima
  • Print_ISBN
    978-0-7695-2983-7
  • Type

    conf

  • DOI
    10.1109/CIT.2007.27
  • Filename
    4385095