مرکز منطقه ای اطلاع رساني علوم و فناوري - SMARTH: Enabling Multi-pipeline Data Transfer in HDFS

DocumentCode :

154095

Title :

SMARTH: Enabling Multi-pipeline Data Transfer in HDFS

Author :

Hong Zhang ; Liqiang Wang ; Hai Huang

Author_Institution :

Dept. of Comput. Sci., Univ. of Wyoming, Laramie, WY, USA

fYear :

2014

fDate :

9-12 Sept. 2014

Firstpage :

Lastpage :

Abstract :

Hadoop is a popular open-source implementation of the MapReduce programming model to handle large data sets, and HDFS is one of Hadoop\´s most commonly used distributed file systems. Surprisingly, we found that HDFS is inefficient when handling upload of data files from client local file system, especially when the storage cluster is configured to use replicas. The root cause is HDFS\´s synchronous pipeline design. In this paper, we introduce an improved HDFS design called SMARTH. It utilizes asynchronous multi-pipeline data transfers instead of a single pipeline stop-and-wait mechanism. SMARTH records the actual transfer speed of data blocks and sends this information to the namenode along with periodic heartbeat messages. The namenode sorts datanodes according to their past performance and tracks this information continuously. When a client initiates an upload request, the namenode will send it a list of "high performance" datanodes that it thinks will yield the highest throughput for the client. By choosing higher performance datanodes relative to each client and by taking advantage of the multi-pipeline design, our experiments show that SMARTH significantly improves the performance of data write operations compared to HDFS. Specifically, SMARTH is able to improve the throughput of data transfer by 27-245% in a heterogeneous virtual cluster on Amazon EC2.

Keywords :

distributed databases; pipeline processing; Amazon EC2; HDFS synchronous pipeline design; Hadoop; MapReduce programming model; SMARTH; asynchronous multipipeline data transfers; client local file system; data blocks; data files; datanodes; distributed file systems; heterogeneous virtual cluster; multipipeline design; namenode; open source implementation; periodic heartbeat messages; single pipeline stop-and-wait mechanism; storage cluster; Bandwidth; Clustering algorithms; Data communication; Fault tolerance; Fault tolerant systems; Pipelines; Production;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel Processing (ICPP), 2014 43rd International Conference on

Conference_Location :

Minneapolis MN

ISSN :

0190-3918

Type :

conf

DOI :

10.1109/ICPP.2014.12

Filename :

6957212

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=154095