DocumentCode :
154095
Title :
SMARTH: Enabling Multi-pipeline Data Transfer in HDFS
Author :
Hong Zhang ; Liqiang Wang ; Hai Huang
Author_Institution :
Dept. of Comput. Sci., Univ. of Wyoming, Laramie, WY, USA
fYear :
2014
fDate :
9-12 Sept. 2014
Firstpage :
30
Lastpage :
39
Abstract :
Hadoop is a popular open-source implementation of the MapReduce programming model to handle large data sets, and HDFS is one of Hadoop\´s most commonly used distributed file systems. Surprisingly, we found that HDFS is inefficient when handling upload of data files from client local file system, especially when the storage cluster is configured to use replicas. The root cause is HDFS\´s synchronous pipeline design. In this paper, we introduce an improved HDFS design called SMARTH. It utilizes asynchronous multi-pipeline data transfers instead of a single pipeline stop-and-wait mechanism. SMARTH records the actual transfer speed of data blocks and sends this information to the namenode along with periodic heartbeat messages. The namenode sorts datanodes according to their past performance and tracks this information continuously. When a client initiates an upload request, the namenode will send it a list of "high performance" datanodes that it thinks will yield the highest throughput for the client. By choosing higher performance datanodes relative to each client and by taking advantage of the multi-pipeline design, our experiments show that SMARTH significantly improves the performance of data write operations compared to HDFS. Specifically, SMARTH is able to improve the throughput of data transfer by 27-245% in a heterogeneous virtual cluster on Amazon EC2.
Keywords :
distributed databases; pipeline processing; Amazon EC2; HDFS synchronous pipeline design; Hadoop; MapReduce programming model; SMARTH; asynchronous multipipeline data transfers; client local file system; data blocks; data files; datanodes; distributed file systems; heterogeneous virtual cluster; multipipeline design; namenode; open source implementation; periodic heartbeat messages; single pipeline stop-and-wait mechanism; storage cluster; Bandwidth; Clustering algorithms; Data communication; Fault tolerance; Fault tolerant systems; Pipelines; Production;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel Processing (ICPP), 2014 43rd International Conference on
Conference_Location :
Minneapolis MN
ISSN :
0190-3918
Type :
conf
DOI :
10.1109/ICPP.2014.12
Filename :
6957212
Link To Document :
بازگشت