Title :
A virtual machine based task scheduling approach to improving data locality for virtualized Hadoop
Author :
Ruiqi Sun ; Jie Yang ; Zhan Gao ; ZhiQiang He
Author_Institution :
Sch. of Comput. Sci. & Eng., Beihang Univ., Beijing, China
Abstract :
MapReduce emerges as an important distributed programming paradigm for large-scale data analysis applications. As an open-source implementation of MapReduce, Hadoop presents an attractive usage system for many enterprises. There are some drawbacks in a traditional Hadoop cluster deployed with a large scale of physical machines, such as burdensome cluster management and fluctuating resource utilization. Virtualized Hadoop cluster not only simplifies cluster management, but also facilitates cost-effective workload consolidation for resource utilization. In Hadoop system, the data locality is a critical factor impacting on performance of MapReduce applications. However, existing task scheduling approaches to improving data locality of virtualized Hadoop are not effective because of two levels distribution of data: virtual machines and physical servers. In this paper, we deploy virtualized Hadoop cluster in which computing node and storage node are placed in respective virtual machines to improve flexibility. We propose a novel task scheduling approach which aims to improve data locality for virtualized Hadoop cluster through migrating the virtual machine acted as computing node to the physical server running virtual machine acted as storage node that holds a data replica needed by that computing node. We evaluated our approach´s efficiency on a virtualized Hadoop cluster with the aforementioned deployment for 11 computing nodes and 12 storage nodes. Our experiment results show that our approach improves performance of 86% typical MapReduce applications in our benchmark suite at varying degrees.
Keywords :
Big Data; data analysis; distributed databases; scheduling; virtual machines; Big Data analysis; MapReduce applications; data locality; distributed programming; storage node; task scheduling; virtual machine; virtualized Hadoop cluster; Benchmark testing; Computer architecture; Monitoring; Processor scheduling; Resource management; Servers; Virtualization; MapReduce; data locality; live migrate; scheduling algorithm; virtualized Hadoop;
Conference_Titel :
Computer and Information Science (ICIS), 2014 IEEE/ACIS 13th International Conference on
Conference_Location :
Taiyuan
DOI :
10.1109/ICIS.2014.6912150