Title :
A data reusing strategy based on hive
Author :
Heng Xie ; Mei Wang ; Jiajin Le
Author_Institution :
Sch. of Comput. Sci. & Technol., DongHua Univ., Shanghai, China
Abstract :
Large scale data process has emerged as an important issue for concerned researchers. By reusing calculation results, the efficiency of large scale data process can be improved greatly. This paper proposes an efficient data reusing strategy based on the data warehouse tool-Hive, which works on MapReduce framework. Since the intermediate calculation results have been stored in DFS by different jobs in MapReduce workflow, the key issue is how to find the ruse information. This paper deals with this problem by two steps. In the proposed method, firstly, we define a joint object to organize and store the features of intermediate calculation results. Then, based on joint objects, this paper provides the algorithm to match and generate the reuse plan. This paper provides a way to obtain the best reuse strategy in case that there are more than one calculation result can be used. We conduct the experiments based on TPC-H and SSB benchmarks. The experimental results have demonstrated that our strategy can significantly improve the efficiency of large scale data process, and have little effect on queries executed at first time.
Keywords :
data handling; data warehouses; Hive; MapReduce framework; MapReduce workflow; SSB benchmarks; TPC-H benchmarks; data reusing strategy; data warehouse tool; joint objects; large scale data process; reuse plan; Amplitude modulation; Benchmark testing; Computational modeling; Data models; Educational institutions; Finite element analysis; Joints; Hive; MapReduce; calculation results reuse; join-object;
Conference_Titel :
Data Science and Advanced Analytics (DSAA), 2014 International Conference on
DOI :
10.1109/DSAA.2014.7058098