DocumentCode :
2450621
Title :
Improving MapReduce performance through data placement in heterogeneous Hadoop clusters
Author :
Xie, Jiong ; Yin, Shu ; Ruan, Xiaojun ; Ding, Zhiyang ; Tian, Yun ; Majors, James ; Manzanares, Adam ; Qin, Xiao
Author_Institution :
Dept. of Comput. Sci. & Software Eng., Auburn Univ., Auburn, AL, USA
fYear :
2010
fDate :
19-23 April 2010
Firstpage :
1
Lastpage :
9
Abstract :
MapReduce has become an important distributed processing model for large-scale data-intensive applications like data mining and web indexing. Hadoop-an open-source implementation of MapReduce is widely used for short jobs requiring low response time. The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into account for launching speculative map tasks, because it is assumed that most maps are data-local. Unfortunately, both the homogeneity and data locality assumptions are not satisfied in virtualized data centers. We show that ignoring the data-locality issue in heterogeneous environments can noticeably reduce the MapReduce performance. In this paper, we address the problem of how to place data across nodes in a way that each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster, our data placement scheme adaptively balances the amount of data stored in each node to achieve improved data-processing performance. Experimental results on two real data-intensive applications show that our data placement strategy can always improve the MapReduce performance by rebalancing data across nodes before performing a data-intensive application in a heterogeneous Hadoop cluster.
Keywords :
data mining; distributed processing; indexing; information resources; pattern clustering; resource allocation; MapReduce; Web indexing; balanced data processing load; data locality; data mining; data placement strategy; distributed processing model; heterogeneous Hadoop cluster; large scale data intensive application; open source implementation; virtualized data centers; Computer science; Data mining; Data processing; Facebook; Indexing; Large-scale systems; Open source software; Peer to peer computing; Programming profession; Software engineering;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on
Conference_Location :
Atlanta, GA
Print_ISBN :
978-1-4244-6533-0
Type :
conf
DOI :
10.1109/IPDPSW.2010.5470880
Filename :
5470880
Link To Document :
بازگشت