مرکز منطقه ای اطلاع رساني علوم و فناوري - Towards Fault-Tolerant Energy-Efficient High Performance Computing in the Cloud

DocumentCode :

1926105

Title :

Towards Fault-Tolerant Energy-Efficient High Performance Computing in the Cloud

Author :

Keville, Kurt L. ; Garg, Rohan ; Yates, David J. ; Arya, Kapil ; Cooperman, Gene

Author_Institution :

Mass. Inst. of Technol., Cambridge, MA, USA

fYear :

2012

fDate :

24-28 Sept. 2012

Firstpage :

622

Lastpage :

626

Abstract :

In cluster computing, power and cooling represent a significant cost compared to the hardware itself. This is of special concern in the cloud, which provides access to large numbers of computers. We examine the use of ARM-based clusters for low-power, high performance computing. This work examines two likely use-modes: (i) a standard dedicated cluster, and (ii) a cluster of pre-configured virtual machines in the cloud. A 40-node department-level cluster based on an ARM Cortex-A9 is compared against a similar cluster based on an Intel Core2 Duo, in contrast to a recent similar study on just a 4-node cluster. For the NAS benchmarks on 32-node clusters, ARM was found to have a power efficiency ranging from 1.3 to 6.2 times greater than that of Intel. This is despite Intel´s approximately five times greater performance. The particular efficiency ratio depends primarily on the size of the working set relative to L2 cache. In addition to energy-efficient computing, this study also emphasizes fault tolerance: an important ingredient in high performance computing. It relies on two recent extensions to the DMTCP checkpoint-restart package. DMTCP was extended (i) to support ARM CPUs, and (ii) to support check pointing of the Qemu virtual machine in user-mode. DMTCP is used both to checkpoint native distributed applications, and to checkpoint a network of virtual machines. This latter case demonstrates the ability to deploy pre-configured software in virtual machines hosted in the cloud, and further to migrate cluster computation between hosts in the cloud.

Keywords :

cache storage; checkpointing; cloud computing; energy conservation; fault tolerant computing; parallel architectures; parallel machines; power aware computing; virtual machines; workstation clusters; 32-node clusters; 40-node department-level cluster; ARM CPU; ARM Cortex-A9; ARM-based clusters; DMTCP checkpoint-restart package; L2 cache; Qemu virtual machine; checkpoint native distributed application; cloud computing; cluster computing; cooling; energy-efficient computing; fault tolerant energy-efficient high performance computing; low-power computing; power efficiency; preconfigured software deployment; preconfigured virtual machine cluster; Benchmark testing; Checkpointing; Computer architecture; Computers; High performance computing; Sockets; Virtual machining; ARM; checkpoint-restart; cluster computing; energy-efficient computing; high performance computing;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Cluster Computing (CLUSTER), 2012 IEEE International Conference on

Conference_Location :

Beijing

Print_ISBN :

978-1-4673-2422-9

Type :

conf

DOI :

10.1109/CLUSTER.2012.74

Filename :

6337837

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1926105