• DocumentCode
    2457010
  • Title

    Load Balancing in MapReduce Based on Scalable Cardinality Estimates

  • Author

    Gufler, Benjamin ; Augsten, Nikolaus ; Reiser, Angelika ; Kemper, Alfons

  • Author_Institution
    Tech. Univ. Munchen, Garching bei Munchen, Germany
  • fYear
    2012
  • fDate
    1-5 April 2012
  • Firstpage
    522
  • Lastpage
    533
  • Abstract
    MapReduce has emerged as a popular tool for distributed and scalable processing of massive data sets and is being used increasingly in e-science applications. Unfortunately, the performance of MapReduce systems strongly depends on an even data distribution while scientific data sets are often highly skewed. The resulting load imbalance, which raises the processing time, is even amplified by high runtime complexity of the reducer tasks. An adaptive load balancing strategy is required for appropriate skew handling. In this paper, we address the problem of estimating the cost of the tasks that are distributed to the reducers based on a given cost model. An accurate cost estimation is the basis for adaptive load balancing algorithms and requires to gather statistics from the mappers. This is challenging: (a) Since the statistics from all mappers must be integrated, the mapper statistics must be small. (b) Although each mapper sees only a small fraction of the data, the integrated statistics must capture the global data distribution. (c) The mappers terminate after sending the statistics to the controller, and no second round is possible. Our solution to these challenges consists of two components. First, a monitoring component executed on every mapper captures the local data distribution and identifies its most relevant subset for cost estimation. Second, an integration component aggregates these subsets approximating the global data distribution.
  • Keywords
    cloud computing; computational complexity; distributed databases; natural sciences computing; resource allocation; statistical distributions; MapReduce; adaptive load balancing strategy; batch-style job processing; cloud computing environments; cost estimation; distributed databases; e-science applications; global data distribution; integration component; load imbalance; local data distribution; mapper statistics; massive data set distributed processing; massive data set scalable processing; processing time; runtime complexity; scalable cardinality estimates; scientific data sets; skew handling; Approximation methods; Clustering algorithms; Estimation; Histograms; Load management; Monitoring; Partitioning algorithms;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2012 IEEE 28th International Conference on
  • Conference_Location
    Washington, DC
  • ISSN
    1063-6382
  • Print_ISBN
    978-1-4673-0042-1
  • Type

    conf

  • DOI
    10.1109/ICDE.2012.58
  • Filename
    6228111