• DocumentCode
    650606
  • Title

    Elephant, Do Not Forget Everything! Efficient Processing of Growing Datasets

  • Author

    Schad, Jorg ; Quianee-Ruiz, Jorge-Arnulfo ; Dittrich, J.

  • Author_Institution
    Inf. Syst. Group, Saarland Univ., Saarbrϋcken, Germany
  • fYear
    2013
  • fDate
    June 28 2013-July 3 2013
  • Firstpage
    252
  • Lastpage
    259
  • Abstract
    MapReduce has become quite popular to analyse very large datasets. Nevertheless, users typically have to run their MapReduce jobs over the whole dataset every time the dataset is appended by new records. Some researchers have proposed to reuse the intermediate data produced by previous MapReduce jobs. However, existing works still have to read the whole dataset in order to identify which parts of the dataset changed. Furthermore, storing intermediate results is not suitable in some cases, because it can lead to a very high storage overhead. In this paper, we propose Itchy, a MapReduce-based system that employes a set of different techniques to efficiently deal with growing datasets. Itchy uses an optimizer to automatically choose the right technique to process a MapReduce job. The beauty of Itchy is that it does not have to read the whole dataset again to deal with new records. In more detail, Itchy keeps track of the provenance of intermediate results in order to selectively recompute intermediate results as required. But, if intermediate results are small or the computational cost of map functions is high, Itchy can automatically start storing intermediate results rather than the provenance information. Additionally, Itchy also supports the option of directly merging outputs from several jobs in cases where MapReduce jobs allow for such kind of processing. We evaluate Itchy using two different benchmarks and compare it with Hadoop and Incoop. The results show the superiority of Itchy over both baseline systems for processing incremental jobs. In terms of job runtime, Itchy is more than one order of magnitude faster than Hadoop (up to ~41 times faster) and Incoop (up to ~11 times faster).
  • Keywords
    data analysis; merging; parallel processing; Hadoop; Incoop; Itchy; MapReduce job; MapReduce-based system; computational cost; incremental job processing; job runtime; large scale data processing; map functions; merging; optimizer; provenance information; very large datasets; Google; Indexes; Maintenance engineering; Merging; Runtime; Standards; Big Data; HDFS; Hadoop; Incremental; MapReduce;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cloud Computing (CLOUD), 2013 IEEE Sixth International Conference on
  • Conference_Location
    Santa Clara, CA
  • Print_ISBN
    978-0-7695-5028-2
  • Type

    conf

  • DOI
    10.1109/CLOUD.2013.67
  • Filename
    6676702