DocumentCode
650606
Title
Elephant, Do Not Forget Everything! Efficient Processing of Growing Datasets
Author
Schad, Jorg ; Quianee-Ruiz, Jorge-Arnulfo ; Dittrich, J.
Author_Institution
Inf. Syst. Group, Saarland Univ., Saarbrϋcken, Germany
fYear
2013
fDate
June 28 2013-July 3 2013
Firstpage
252
Lastpage
259
Abstract
MapReduce has become quite popular to analyse very large datasets. Nevertheless, users typically have to run their MapReduce jobs over the whole dataset every time the dataset is appended by new records. Some researchers have proposed to reuse the intermediate data produced by previous MapReduce jobs. However, existing works still have to read the whole dataset in order to identify which parts of the dataset changed. Furthermore, storing intermediate results is not suitable in some cases, because it can lead to a very high storage overhead. In this paper, we propose Itchy, a MapReduce-based system that employes a set of different techniques to efficiently deal with growing datasets. Itchy uses an optimizer to automatically choose the right technique to process a MapReduce job. The beauty of Itchy is that it does not have to read the whole dataset again to deal with new records. In more detail, Itchy keeps track of the provenance of intermediate results in order to selectively recompute intermediate results as required. But, if intermediate results are small or the computational cost of map functions is high, Itchy can automatically start storing intermediate results rather than the provenance information. Additionally, Itchy also supports the option of directly merging outputs from several jobs in cases where MapReduce jobs allow for such kind of processing. We evaluate Itchy using two different benchmarks and compare it with Hadoop and Incoop. The results show the superiority of Itchy over both baseline systems for processing incremental jobs. In terms of job runtime, Itchy is more than one order of magnitude faster than Hadoop (up to ~41 times faster) and Incoop (up to ~11 times faster).
Keywords
data analysis; merging; parallel processing; Hadoop; Incoop; Itchy; MapReduce job; MapReduce-based system; computational cost; incremental job processing; job runtime; large scale data processing; map functions; merging; optimizer; provenance information; very large datasets; Google; Indexes; Maintenance engineering; Merging; Runtime; Standards; Big Data; HDFS; Hadoop; Incremental; MapReduce;
fLanguage
English
Publisher
ieee
Conference_Titel
Cloud Computing (CLOUD), 2013 IEEE Sixth International Conference on
Conference_Location
Santa Clara, CA
Print_ISBN
978-0-7695-5028-2
Type
conf
DOI
10.1109/CLOUD.2013.67
Filename
6676702
Link To Document