Title :
Harp: Collective Communication on Hadoop
Author :
Bingjing Zhang ; Yang Ruan ; Qiu, Judy
Author_Institution :
Comput. Sci. Dept., Indiana Univ., Bloomington, IN, USA
Abstract :
Big data processing tools have evolved rapidly in recent years. MapReduce has proven very successful but is not optimized for many important analytics, especially those involving iteration. In this regard, Iterative MapReduce frameworks improve performance of MapReduce job chains through caching. Further, Pregel, Giraph and Graph Lab abstract data as a graph and process it in iterations. But all these tools are designed with a fixed data abstraction and have limited collective communication support to synchronize application data and algorithm control states among parallel processes. In this paper, we introduce a collective communication abstraction layer which provides efficient collective communication operations on several common data abstractions such as arrays, key-values and graphs, and define a Map Collective programming model which serves the diverse collective communication demands in different parallel algorithms. We implement a library called Harp to provide the features above and plug it into Hadoop so that applications abstracted in Map Collective model can be easily developed on top of MapReduce framework and conveniently integrated with other tools in Apache Big Data Stack. With improved expressiveness in the abstraction and excellent performance on the implementation, we can simultaneously support various applications from HPC to Cloud systems together with high performance.
Keywords :
data handling; parallel algorithms; programming; Further abstract data; Giraph abstract data; GraphLab abstract data; HPC systems; Hadoop; Harp; MapCollective programming model; Pregel abstract data; apache big data stack; application data; arrays; big data processing tools; cloud systems; collective communication abstraction layer; data abstractions; fixed data abstraction; iterative MapReduce frameworks; job chains; key-values; parallel algorithms; parallel processes; Arrays; Big data; Computational modeling; Data models; Partitioning algorithms; Programming; Time complexity; Big Data Processing; Collective Communication; Hadoop;
Conference_Titel :
Cloud Engineering (IC2E), 2015 IEEE International Conference on
Conference_Location :
Tempe, AZ
DOI :
10.1109/IC2E.2015.35