مرکز منطقه ای اطلاع رساني علوم و فناوري - Improving MapReduce performance through data placement in heterogeneous Hadoop clusters

DocumentCode :

2450621

Title :

Improving MapReduce performance through data placement in heterogeneous Hadoop clusters

Author :

Xie, Jiong ; Yin, Shu ; Ruan, Xiaojun ; Ding, Zhiyang ; Tian, Yun ; Majors, James ; Manzanares, Adam ; Qin, Xiao

Author_Institution :

Dept. of Comput. Sci. & Software Eng., Auburn Univ., Auburn, AL, USA

fYear :

2010

fDate :

19-23 April 2010

Firstpage :

Lastpage :

Abstract :

MapReduce has become an important distributed processing model for large-scale data-intensive applications like data mining and web indexing. Hadoop-an open-source implementation of MapReduce is widely used for short jobs requiring low response time. The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into account for launching speculative map tasks, because it is assumed that most maps are data-local. Unfortunately, both the homogeneity and data locality assumptions are not satisfied in virtualized data centers. We show that ignoring the data-locality issue in heterogeneous environments can noticeably reduce the MapReduce performance. In this paper, we address the problem of how to place data across nodes in a way that each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster, our data placement scheme adaptively balances the amount of data stored in each node to achieve improved data-processing performance. Experimental results on two real data-intensive applications show that our data placement strategy can always improve the MapReduce performance by rebalancing data across nodes before performing a data-intensive application in a heterogeneous Hadoop cluster.

Keywords :

data mining; distributed processing; indexing; information resources; pattern clustering; resource allocation; MapReduce; Web indexing; balanced data processing load; data locality; data mining; data placement strategy; distributed processing model; heterogeneous Hadoop cluster; large scale data intensive application; open source implementation; virtualized data centers; Computer science; Data mining; Data processing; Facebook; Indexing; Large-scale systems; Open source software; Peer to peer computing; Programming profession; Software engineering;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on

Conference_Location :

Atlanta, GA

Print_ISBN :

978-1-4244-6533-0

Type :

conf

DOI :

10.1109/IPDPSW.2010.5470880

Filename :

5470880

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2450621