Author :
Elteir, Marwa ; Lin, Heshan ; Feng, Wu-chun
Abstract :
The Map Reduce programming model simplifies large-scale data processing on commodity clusters by having users specify a map function that processes input key/value pairs to generate intermediate key/value pairs, and a reduce function that merges and converts intermediate key/value pairs into final results. Typical Map Reduce implementations such as Hadoop enforce barrier synchronization between the map and reduce phases, i.e., the reduce phase does not start until all map tasks are finished. In turn, this synchronization requirement can cause inefficient utilization of computing resources and can adversely impact performance. Thus, we present and evaluate two different approaches to cope with the synchronization drawback of existing Map Reduce implementations. The first approach, hierarchical reduction, starts a reduce task as soon as a predefined number of map tasks completes, it then aggregates the results of different reduce tasks following a tree structure. The second approach, incremental reduction, starts a predefined number of reduce tasks from the beginning and has each reduce task incrementally reduce records collected from map tasks. Together with our performance modeling, we evaluate different reducing approaches with two real applications on a 32-node cluster. The experimental results have shown that incremental reduction outperforms hierarchical reduction in general. Also, incremental reduction can speed-up the original Hadoop implementation by up to 35.33% for the word count application and 57.98% for the grep application. In addition, incremental reduction outperforms the original Hadoop in an emulated cloud environment with heterogeneous compute nodes.
Keywords :
cloud computing; resource allocation; software performance evaluation; workstation clusters; Hadoop enforce barrier synchronization; MapReduce programming model; asynchronous data processing; cloud computing; commodity cluster; computing resource utilization; distributed computing; emulated cloud environment; grep application; hierarchical reduction; incremental reduction; large-scale data processing; performance modeling; tree structure; Asynchronous processing; Cloud Computing; Distributed Computing; Hadoop; MapReduce;