• DocumentCode
    244104
  • Title

    Managing Tiny Tasks for Data-Parallel, Subsampling Workloads

  • Author

    Kambhampati, S. ; Kelley, Jaimie ; Stewart, Craig ; Stewart, William C. L. ; Ramnath, Rajiv

  • fYear
    2014
  • fDate
    11-14 March 2014
  • Firstpage
    225
  • Lastpage
    234
  • Abstract
    Subsampling workloads compute statistics from a set of observed samples using a random subset of sample data (i.e., a subsample). Data-parallel platforms group these samples into tasks, each task subsamples its data in parallel. In this paper, we study subsampling workloads that benefit from tiny tasks-i.e., tasks comprising few samples. Tiny tasks reduce processor cache misses caused by random subsampling, which speeds up per-task running time. However, they can also cause significant scheduling overheads that negate the time reduction from reduced cache misses. For example, vanilla Hadoop takes longer to start tiny tasks than to run them. We compared the task scheduling overheads of vanilla Hadoop, lightweight Hadoop setups, and BashReduce. BashReduce, the best platform, outperformed the worst by 3.6X but scheduling overhead was still 12% of a task´s running time. We improved BashReduce´s scheduler by allowing it to size tasks according to kneepoints on the miss rate curve. We tested these changes on high-throughput genotype data and on data obtained from Netflix. Our improved BashReduce outperformed vanilla Hadoop by almost 3X and completed short, interactive jobs almost as efficiently as long jobs. These results held at scale and across diverse, heterogeneous hardware.
  • Keywords
    cache storage; parallel processing; scheduling; statistics; BashReduce scheduler; Netflix; data-parallel platform; high-throughput genotype data; lightweight Hadoop setups; miss rate curve; processor cache misses reduction; statistics; subsampling workloads; task scheduling overheads; tiny task management; vanilla Hadoop; Benchmark testing; Bioinformatics; Delays; Genomics; Monitoring; Runtime; Software;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cloud Engineering (IC2E), 2014 IEEE International Conference on
  • Conference_Location
    Boston, MA
  • Type

    conf

  • DOI
    10.1109/IC2E.2014.94
  • Filename
    6903477