Title :
Novel Data-Distribution Technique for Hadoop in Heterogeneous Cloud Environments
Author :
Ubarhande, Vrushali ; Popescu, Alina-Madalina ; Gonzalez-Velez, Horacio
Author_Institution :
Cloud Competency Centre, Nat. Coll. of Ireland, Dublin, Ireland
Abstract :
The Hadoop framework has been developed to effectively process data-intensive MapReduce applications. Hadoop users specify the application computation logic in terms of a map and a reduce function, which are often termed MapReduce applications. The Hadoop distributed file system is used to store the MapReduce application data on the Hadoop cluster nodes called Data nodes, whereas Name node is a control point for all Data nodes. While its resilience is increased, its current data-distribution methodologies are not necessarily efficient for heterogeneous distributed environments such as public clouds. This work contends that existing data distribution techniques are not necessarily suitable, since the performance of Hadoop typically degrades in heterogeneous environments whenever data-distribution is not determined as per the computing capability of the nodes. The concept of data-locality and its impact on the performance of Hadoop are key factors, since they affect the performance in the Map phase when scheduling tasks. The task scheduling techniques in Hadoop should arguably consider data locality to enhance performance. Various task scheduling techniques have been analysed to understand their data-locality awareness while scheduling applications. Other system factors also play a major role while achieving high performance in Hadoop data processing. The main contribution of this work is a novel methodology for data placement for Hadoop Data nodes based on their computing ratio. Two standard MapReduce applications, Word Count and Grep, have been executed and a significant performance improvement has been observed based on our proposed data distribution technique.
Keywords :
cloud computing; data handling; distributed databases; network operating systems; parallel processing; scheduling; Hadoop cluster nodes; Hadoop data nodes; Hadoop data processing; Hadoop distributed file system; Map phase; application computation logic; control point; data nodes; data placement; data-distribution technique; data-intensive MapReduce applications; data-locality awareness; heterogeneous cloud environments; name node; public clouds; task scheduling techniques; Bandwidth; Cloud computing; Data processing; Distributed databases; Processor scheduling; Random access memory; Time factors; Cloud Computing; Data Locality; Data Placement; Distributed Computing; Hadoop; MapReduce;
Conference_Titel :
Complex, Intelligent, and Software Intensive Systems (CISIS), 2015 Ninth International Conference on
Conference_Location :
Blumenau
Print_ISBN :
978-1-4799-8869-3
DOI :
10.1109/CISIS.2015.37