DocumentCode
1665508
Title
A Study of Data Locality in YARN
Author
Elshater, Yehia ; Martin, Patrick ; Rope, Dan ; McRoberts, Mike ; Statchuk, Craig
Author_Institution
Sch. of Comput., Queen´s Univ., Kingston, ON, Canada
fYear
2015
Firstpage
174
Lastpage
181
Abstract
Co-locating the computation as close as possible to the data is an important consideration in the current data intensive systems. This is known as data locality problem. In this paper, we analyze the impact of data locality on YARN, which is the new version of Hadoop. We investigate YARN delay scheduler behavior with respect to data locality for a variety of workloads and configurations. We address in this paper three problems related to data locality. First, we study the trade-off between the data locality and the job completion time. Secondly, we observe that there is an imbalance of resource allocation when considering the data locality, which may under-utilize the cluster. Thirdly, we address the redundant I/O operations when different YARN containers request input data blocks on the same node. Additionally, we propose YARN Locality Simulator (YLocSim), a simulator tool that simulates the interactions between YARN components in a real cluster and reports the data locality percentages in real time. We validate YLocSim over a real cluster setup and use it in our study.
Keywords
data handling; digital simulation; input-output programs; parallel processing; resource allocation; scheduling; Hadoop; I/O operation; YARN delay scheduler behavior; YARN locality simulator tool; YLocSim; data intensive system; data locality; resource allocation; Bandwidth; Benchmark testing; Containers; Delays; Resource management; Scheduling; Yarn; Data Locality; Hadoop; Scheduling; Simulation; YARN;
fLanguage
English
Publisher
ieee
Conference_Titel
Big Data (BigData Congress), 2015 IEEE International Congress on
Conference_Location
New York, NY
Print_ISBN
978-1-4673-7277-0
Type
conf
DOI
10.1109/BigDataCongress.2015.33
Filename
7207217
Link To Document