مرکز منطقه ای اطلاع رساني علوم و فناوري - A Failure Recovery Solution for Transplanting High-Performance Data-Intensive Algorithms from the Cluster to the Cloud

DocumentCode :

688316

Title :

A Failure Recovery Solution for Transplanting High-Performance Data-Intensive Algorithms from the Cluster to the Cloud

Author :

Da-Qi Ren ; Zane Wei

Author_Institution :

US R&D Center, Huawei Technol., Santa Clara, CA, USA

fYear :

2013

fDate :

13-15 Nov. 2013

Firstpage :

1463

Lastpage :

1468

Abstract :

The computing-cloud manages huge numbers of virtualized resources to provide uniquely beneficial computing paradigms for scientific research. A modern cloud can behave in a virtual context - much like a local homogeneous computer cluster - to deliver High Performance Computing (HPC) platforms that provide public users with access, cut purchase costs, and eliminate the maintenance burden of sophisticated hardware. For decades most distributed scientific computing software has been designed to run on clusters. Research on how to transplant cluster-based programs and performance-tuning mechanisms onto the cloud platform has gathered momentum in recent years. This paper introduces a fault tolerant approach that assures the reliability virtual clusters on clouds where high-performance and data-intensive computing paradigms are deployed. We have solved the failure recovery issue for TCP connections containing MPI error handlers by exploiting and modeling the constraints of low-level distributed resources. The combined MPI and TCP environment can support software development for multiple parallel programming models, including asynchronous distributed computing based on MPI for scientific HPC and synchronous distributed computing for big data, such as MapReduce and Pregal. This paper sets out detailed MPI/TCP fault-tolerant mechanisms, including primitives and functions. These elements enable the systematic and hierarchical development of a globally optimized HPC on the cloud platform.

Keywords :

cloud computing; parallel processing; system recovery; virtualisation; HPC platforms; MPI error; MPI/TCP fault-tolerant mechanisms; TCP connections; TCP environment; asynchronous distributed computing; cloud computing; cloud platform; computer cluster; data intensive computing; distributed scientific computing software; failure recovery solution; high performance computing; multiple parallel programming models; performance tuning mechanisms; reliability virtual clusters; software development; sophisticated hardware; transplanting high performance data-intensive algorithms; virtual context; virtualized resources; Cloud computing; Computational modeling; Fault tolerance; Fault tolerant systems; Hardware; Virtual machining; Cloud Computing; Computing; Data-Intensive; Fault Tolerance; High-Performance Computing;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on

Conference_Location :

Zhangjiajie

Type :

conf

DOI :

10.1109/HPCC.and.EUC.2013.207

Filename :

6832089

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=688316