Title :
Minimizing Remote Accesses in MapReduce Clusters
Author :
Tandon, Prateek ; Cafarella, Michael J. ; Wenisch, Thomas F.
Author_Institution :
Dept. of Comput. Sci. & Eng., Univ. of Michigan, Ann Arbor, MI, USA
Abstract :
MapReduce, in particular Hadoop, is a popular framework for the distributed processing of large datasets on clusters of relatively inexpensive servers. Although Hadoop clusters are highly scalable and ensure data availability in the face of server failures, their efficiency is poor. We study data placement as a potential source of inefficiency. Despite networking improvements that have narrowed the performance gap between map tasks that access local or remote data, we find that nodes servicing remote HDFS requests see significant slowdowns of collocated map tasks due to interference effects, whereas nodes making these requests do not experience proportionate slowdowns. To reduce remote accesses, and thus avoid their destructive performance interference, we investigate an intelligent data placement policy we call ´partitioned data placement´. We find that, in an unconstrained cluster where a job´s map tasks may be scheduled dynamically on any node over time, Hadoop´s default random data placement is effective in avoiding remote accesses. However, when task placement is restricted by long-running jobs or other reservations, partitioned data placement substantially reduces remote access rates (e.g., by as much as 86% over random placement for a job allocated only one-third of a cluster).
Keywords :
distributed databases; distributed programming; information retrieval; pattern classification; pattern clustering; public domain software; Hadoop cluster; Hadoop default random data placement; MapReduce cluster; data availability; distributed processing; intelligent data placement policy; interference effect; placement data partitioning; remote HDFS request; remote data access; server cluster; server failure; unconstrained cluster; Availability; Bandwidth; Delays; Interference; Resource management; Runtime; Servers; Hadoop; MapReduce; data placement; remote accesses;
Conference_Titel :
Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International
Conference_Location :
Cambridge, MA
Print_ISBN :
978-0-7695-4979-8
DOI :
10.1109/IPDPSW.2013.195