Title :
Multi-file queries performance improvement through data placement in Hadoop
Author :
Yu Tang ; Abdulhay, E. ; Aihua Fan ; Sheng Su ; Gebreselassie, K.
Author_Institution :
Univ. of Electron. Sci. & Technol. of China, Chengdu, China
Abstract :
Hadoop is enjoying popularity for processing data-intensive jobs because of its data locality feature. However, the performance gained from Hadoop´s above feature is currently limited by its default block placement policy, which implicitly assumes instances of MapReduce jobs access data from a single file. On the contrary, multi-file queries like indexing query or aggregation query need to process related data from more than one files found on different DataNodes of a cluster. In this paper we proposed a Correlation-based Block Placement (CBP) Algorithm that enhances the performance of these queries by placing related blocks on the same set of DataNodes. Furthermore, we developed a customized InputFormat that enables InputSplits contain records from different files. Simulation results demonstrated that the number of migrating data blocks for CBP was insignificant. On the contrary, for default policy case, the number of migrating data blocks increased significantly with the input dataset size. As a result, for any input dataset size, the performance of CBP exceeded that of the default policy.
Keywords :
distributed processing; query processing; CBP; Hadoop; aggregation query; block placement policy; correlation-based block placement algorithm; data locality feature; data placement; data-intensive job processing; indexing query; multifile queries performance improvement; multifile query; Block Placement; Correlation; Data locality; HDFS;
Conference_Titel :
Computer Science and Network Technology (ICCSNT), 2012 2nd International Conference on
Conference_Location :
Changchun
Print_ISBN :
978-1-4673-2963-7
DOI :
10.1109/ICCSNT.2012.6526092