Title :
SLDP: A Novel Data Placement Strategy for Large-Scale Heterogeneous Hadoop Cluster
Author :
Runqun Xiong ; Junzhou Luo ; Fang Dong
Author_Institution :
Sch. of Comput. Sci. & Eng., Southeast Univ., Nanjing, China
Abstract :
Hadoop as a popular open-source implementation of MapReduce is widely used for large scale data-intensive applications like data mining, web indexing and scientific computing. The current Hadoop implementation assumes that nodes in a cluster are homogeneous in nature, and Hadoop distributed file system(HDFS) distributes data to multiple nodes based on disk space availability. Such data placement strategy is very efficient for homogeneous environments, where nodes are identical in terms of both computing power and disk capacity. Unfortunately, in practice, the homogeneity assumptions do not always hold. Hadoop´s scheduler will lead to severe performance degradation and energy dissipation in heterogeneous environments by using default data placement strategy of HDFS. In this paper, we propose a novel snakelike data placement mechanism (SLDP) for large-scale heterogeneous Hadoop cluster. SLDP adopts a heterogeneity aware algorithm to divide various nodes into several virtual storage tiers(VST) firstly, and then places data blocks across nodes in each VST circuitously according to the hotness of data. Furthermore, SLDP uses a hotness proportional replication to reduce disk space consumption and also has an effective power control function. Experimental results on two real data-intensive applications show that SLDP is energy-efficient, space-saving and able to improve MapReduce performance in a heterogeneous Hadoop cluster significantly.
Keywords :
data handling; data mining; parallel processing; public domain software; scheduling; virtual storage; workstation clusters; HDFS; Hadoop distributed file system; Hadoop scheduler; MapReduce; SLDP; VST; Web indexing; cluster nodes; computing power; data block placement; data distribution; data hotness; data mining; data placement strategy; disk capacity; disk space availability; disk space consumption reduction; energy dissipation; energy efficiency; heterogeneity aware algorithm; homogeneous environment; hotness proportional replication; large scale data-intensive application; large-scale heterogeneous Hadoop cluster; open-source implementation; performance degradation; power control function; scientific computing; snakelike data placement mechanism; space saving; virtual storage tiers; Big data; Clustering algorithms; Distributed databases; Google; Peer-to-peer computing; Power demand; Servers;
Conference_Titel :
Advanced Cloud and Big Data (CBD), 2014 Second International Conference on
Print_ISBN :
978-1-4799-8086-4
DOI :
10.1109/CBD.2014.57