Elephant, Do Not Forget Everything! Efficient Processing of Growing Datasets

Author

Schad, Jorg ; Quianee-Ruiz, Jorge-Arnulfo ; Dittrich, J.

Author_Institution

Inf. Syst. Group, Saarland Univ., Saarbrϋcken, Germany

fYear

2013

fDate

June 28 2013-July 3 2013

Firstpage

252

Lastpage

259

Abstract

MapReduce has become quite popular to analyse very large datasets. Nevertheless, users typically have to run their MapReduce jobs over the whole dataset every time the dataset is appended by new records. Some researchers have proposed to reuse the intermediate data produced by previous MapReduce jobs. However, existing works still have to read the whole dataset in order to identify which parts of the dataset changed. Furthermore, storing intermediate results is not suitable in some cases, because it can lead to a very high storage overhead. In this paper, we propose Itchy, a MapReduce-based system that employes a set of different techniques to efficiently deal with growing datasets. Itchy uses an optimizer to automatically choose the right technique to process a MapReduce job. The beauty of Itchy is that it does not have to read the whole dataset again to deal with new records. In more detail, Itchy keeps track of the provenance of intermediate results in order to selectively recompute intermediate results as required. But, if intermediate results are small or the computational cost of map functions is high, Itchy can automatically start storing intermediate results rather than the provenance information. Additionally, Itchy also supports the option of directly merging outputs from several jobs in cases where MapReduce jobs allow for such kind of processing. We evaluate Itchy using two different benchmarks and compare it with Hadoop and Incoop. The results show the superiority of Itchy over both baseline systems for processing incremental jobs. In terms of job runtime, Itchy is more than one order of magnitude faster than Hadoop (up to ~41 times faster) and Incoop (up to ~11 times faster).

Keywords

data analysis; merging; parallel processing; Hadoop; Incoop; Itchy; MapReduce job; MapReduce-based system; computational cost; incremental job processing; job runtime; large scale data processing; map functions; merging; optimizer; provenance information; very large datasets; Google; Indexes; Maintenance engineering; Merging; Runtime; Standards; Big Data; HDFS; Hadoop; Incremental; MapReduce;

fLanguage

English

Publisher

ieee

Conference_Titel

Cloud Computing (CLOUD), 2013 IEEE Sixth International Conference on

Conference_Location

Santa Clara, CA

Print_ISBN

978-0-7695-5028-2

Type

conf

DOI

10.1109/CLOUD.2013.67

Filename

6676702