Title :
Elephant, Do Not Forget Everything! Efficient Processing of Growing Datasets
Author :
Schad, Jorg ; Quianee-Ruiz, Jorge-Arnulfo ; Dittrich, J.
Author_Institution :
Inf. Syst. Group, Saarland Univ., Saarbrϋcken, Germany
fDate :
June 28 2013-July 3 2013
Abstract :
MapReduce has become quite popular to analyse very large datasets. Nevertheless, users typically have to run their MapReduce jobs over the whole dataset every time the dataset is appended by new records. Some researchers have proposed to reuse the intermediate data produced by previous MapReduce jobs. However, existing works still have to read the whole dataset in order to identify which parts of the dataset changed. Furthermore, storing intermediate results is not suitable in some cases, because it can lead to a very high storage overhead. In this paper, we propose Itchy, a MapReduce-based system that employes a set of different techniques to efficiently deal with growing datasets. Itchy uses an optimizer to automatically choose the right technique to process a MapReduce job. The beauty of Itchy is that it does not have to read the whole dataset again to deal with new records. In more detail, Itchy keeps track of the provenance of intermediate results in order to selectively recompute intermediate results as required. But, if intermediate results are small or the computational cost of map functions is high, Itchy can automatically start storing intermediate results rather than the provenance information. Additionally, Itchy also supports the option of directly merging outputs from several jobs in cases where MapReduce jobs allow for such kind of processing. We evaluate Itchy using two different benchmarks and compare it with Hadoop and Incoop. The results show the superiority of Itchy over both baseline systems for processing incremental jobs. In terms of job runtime, Itchy is more than one order of magnitude faster than Hadoop (up to ~41 times faster) and Incoop (up to ~11 times faster).
Keywords :
data analysis; merging; parallel processing; Hadoop; Incoop; Itchy; MapReduce job; MapReduce-based system; computational cost; incremental job processing; job runtime; large scale data processing; map functions; merging; optimizer; provenance information; very large datasets; Google; Indexes; Maintenance engineering; Merging; Runtime; Standards; Big Data; HDFS; Hadoop; Incremental; MapReduce;
Conference_Titel :
Cloud Computing (CLOUD), 2013 IEEE Sixth International Conference on
Conference_Location :
Santa Clara, CA
Print_ISBN :
978-0-7695-5028-2
DOI :
10.1109/CLOUD.2013.67