• DocumentCode
    3145788
  • Title

    A Hybrid Scheduling Algorithm for Data Intensive Workloads in a MapReduce Environment

  • Author

    Phuong Nguyen ; Simon, Thierry ; Halem, Milton ; Chapman, D. ; Le, Quang

  • Author_Institution
    Dept. Comput. Sci. & Electr. Eng., Univ. of Maryland Baltimore County, Baltimore, MD, USA
  • fYear
    2012
  • fDate
    5-8 Nov. 2012
  • Firstpage
    161
  • Lastpage
    167
  • Abstract
    The specific choice of workload task schedulers for Hadoop MapReduce applications can have a dramatic effect on job workload latency. The Hadoop Fair Scheduler (FairS) assigns resources to jobs such that all jobs get, on average, an equal share of resources over time. Thus, it addresses the problem with a FIFO scheduler when short jobs have to wait for long running jobs to complete. We show that even for the FairS, jobs are still forced to wait significantly when the MapReduce system assigns equal sharing of resources due to dependencies between Map, Shuffle, Sort, Reduce phases. We propose a Hybrid Scheduler (HybS) algorithm based on dynamic priority in order to reduce the latency for variable length concurrent jobs, while maintaining data locality. The dynamic priorities can accommodate multiple task lengths, job sizes, and job waiting times by applying a greedy fractional knapsack algorithm for job task processor assignment. The estimated runtime of Map and Reduce tasks are provided to the HybS dynamic priorities from the historical Hadoop log files. In addition to dynamic priority, we implement a reordering of task processor assignment to account for data availability to automatically maintain the benefits of data locality in this environment. We evaluate our approach by running concurrent workloads consisting of the Word-count and Terasort benchmarks, and a satellite scientific data processing workload and developing a simulator. Our evaluation shows the HybS system improves the average response time for the workloads approximately 2.1x faster over the Hadoop FairS with a standard deviation of 1.4x.
  • Keywords
    data handling; greedy algorithms; knapsack problems; processor scheduling; FIFO scheduler; FairS; Hadoop MapReduce applications; Hadoop fair scheduler; Hadoop log files; HybS system; MapReduce environment; MapReduce system; Terasort benchmark; Word-count benchmark; data availability; data intensive workloads; data locality maintenance; equal resource sharing; greedy fractional knapsack algorithm; hybrid scheduler algorithm; hybrid scheduling algorithm; job sizes; job task processor assignment reordering; job waiting times; job workload latency; map phases; multiple task lengths; reduce phases; satellite scientific data processing workload; shuffle phases; sort phases; variable length concurrent job latency reduction; workload task schedulers; Benchmark testing; Dynamic scheduling; Heuristic algorithms; Runtime; Scheduling algorithms; Time factors; Hadoop; MapReduce; Scheduler; dynamic priority; scheduling; workflow;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Utility and Cloud Computing (UCC), 2012 IEEE Fifth International Conference on
  • Conference_Location
    Chicago, IL
  • Print_ISBN
    978-1-4673-4432-6
  • Type

    conf

  • DOI
    10.1109/UCC.2012.32
  • Filename
    6424941