DocumentCode :
688316
Title :
A Failure Recovery Solution for Transplanting High-Performance Data-Intensive Algorithms from the Cluster to the Cloud
Author :
Da-Qi Ren ; Zane Wei
Author_Institution :
US R&D Center, Huawei Technol., Santa Clara, CA, USA
fYear :
2013
fDate :
13-15 Nov. 2013
Firstpage :
1463
Lastpage :
1468
Abstract :
The computing-cloud manages huge numbers of virtualized resources to provide uniquely beneficial computing paradigms for scientific research. A modern cloud can behave in a virtual context - much like a local homogeneous computer cluster - to deliver High Performance Computing (HPC) platforms that provide public users with access, cut purchase costs, and eliminate the maintenance burden of sophisticated hardware. For decades most distributed scientific computing software has been designed to run on clusters. Research on how to transplant cluster-based programs and performance-tuning mechanisms onto the cloud platform has gathered momentum in recent years. This paper introduces a fault tolerant approach that assures the reliability virtual clusters on clouds where high-performance and data-intensive computing paradigms are deployed. We have solved the failure recovery issue for TCP connections containing MPI error handlers by exploiting and modeling the constraints of low-level distributed resources. The combined MPI and TCP environment can support software development for multiple parallel programming models, including asynchronous distributed computing based on MPI for scientific HPC and synchronous distributed computing for big data, such as MapReduce and Pregal. This paper sets out detailed MPI/TCP fault-tolerant mechanisms, including primitives and functions. These elements enable the systematic and hierarchical development of a globally optimized HPC on the cloud platform.
Keywords :
cloud computing; parallel processing; system recovery; virtualisation; HPC platforms; MPI error; MPI/TCP fault-tolerant mechanisms; TCP connections; TCP environment; asynchronous distributed computing; cloud computing; cloud platform; computer cluster; data intensive computing; distributed scientific computing software; failure recovery solution; high performance computing; multiple parallel programming models; performance tuning mechanisms; reliability virtual clusters; software development; sophisticated hardware; transplanting high performance data-intensive algorithms; virtual context; virtualized resources; Cloud computing; Computational modeling; Fault tolerance; Fault tolerant systems; Hardware; Virtual machining; Cloud Computing; Computing; Data-Intensive; Fault Tolerance; High-Performance Computing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on
Conference_Location :
Zhangjiajie
Type :
conf
DOI :
10.1109/HPCC.and.EUC.2013.207
Filename :
6832089
Link To Document :
بازگشت