DocumentCode :
1926105
Title :
Towards Fault-Tolerant Energy-Efficient High Performance Computing in the Cloud
Author :
Keville, Kurt L. ; Garg, Rohan ; Yates, David J. ; Arya, Kapil ; Cooperman, Gene
Author_Institution :
Mass. Inst. of Technol., Cambridge, MA, USA
fYear :
2012
fDate :
24-28 Sept. 2012
Firstpage :
622
Lastpage :
626
Abstract :
In cluster computing, power and cooling represent a significant cost compared to the hardware itself. This is of special concern in the cloud, which provides access to large numbers of computers. We examine the use of ARM-based clusters for low-power, high performance computing. This work examines two likely use-modes: (i) a standard dedicated cluster, and (ii) a cluster of pre-configured virtual machines in the cloud. A 40-node department-level cluster based on an ARM Cortex-A9 is compared against a similar cluster based on an Intel Core2 Duo, in contrast to a recent similar study on just a 4-node cluster. For the NAS benchmarks on 32-node clusters, ARM was found to have a power efficiency ranging from 1.3 to 6.2 times greater than that of Intel. This is despite Intel´s approximately five times greater performance. The particular efficiency ratio depends primarily on the size of the working set relative to L2 cache. In addition to energy-efficient computing, this study also emphasizes fault tolerance: an important ingredient in high performance computing. It relies on two recent extensions to the DMTCP checkpoint-restart package. DMTCP was extended (i) to support ARM CPUs, and (ii) to support check pointing of the Qemu virtual machine in user-mode. DMTCP is used both to checkpoint native distributed applications, and to checkpoint a network of virtual machines. This latter case demonstrates the ability to deploy pre-configured software in virtual machines hosted in the cloud, and further to migrate cluster computation between hosts in the cloud.
Keywords :
cache storage; checkpointing; cloud computing; energy conservation; fault tolerant computing; parallel architectures; parallel machines; power aware computing; virtual machines; workstation clusters; 32-node clusters; 40-node department-level cluster; ARM CPU; ARM Cortex-A9; ARM-based clusters; DMTCP checkpoint-restart package; L2 cache; Qemu virtual machine; checkpoint native distributed application; cloud computing; cluster computing; cooling; energy-efficient computing; fault tolerant energy-efficient high performance computing; low-power computing; power efficiency; preconfigured software deployment; preconfigured virtual machine cluster; Benchmark testing; Checkpointing; Computer architecture; Computers; High performance computing; Sockets; Virtual machining; ARM; checkpoint-restart; cluster computing; energy-efficient computing; high performance computing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster Computing (CLUSTER), 2012 IEEE International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4673-2422-9
Type :
conf
DOI :
10.1109/CLUSTER.2012.74
Filename :
6337837
Link To Document :
بازگشت