• DocumentCode
    170756
  • Title

    Online load balancing for MapReduce with skewed data input

  • Author

    Yanfang Le ; Jiangchuan Liu ; Ergun, Funda ; Dan Wang

  • Author_Institution
    Simon Fraser Univ., Burnaby, BC, Canada
  • fYear
    2014
  • fDate
    April 27 2014-May 2 2014
  • Firstpage
    2004
  • Lastpage
    2012
  • Abstract
    MapReduce has emerged as a powerful tool for distributed and scalable processing of voluminous data. In this paper, we, for the first time, examine the problem of accommodating data skew in MapReduce with online operations. Different from earlier heuristics in the very late reduce stage or after seeing all the data, we address the skew from the beginning of data input, and make no assumption about a priori knowledge of the data distribution nor require synchronized operations. We examine the input in a continuous fashion and adaptively assign tasks with a load-balanced strategy. We show that the optimal strategy is a constrained version of online minimum makespan and, in the MapReduce context where pairs with identical keys must be scheduled to the same machine, there is an online algorithm with a provable 2-competitive ratio. We further suggest a sample-based enhancement, which, probabilistically, achieves a 3/2-competitive ratio with a bounded error.
  • Keywords
    distributed processing; resource allocation; MapReduce; bounded error; data distribution; load-balanced strategy; online load balancing; online minimum makespan; online operations; provable 2-competitive ratio; sample-based enhancement; skewed data input; voluminous data; Computational modeling; Computers; Conferences; Distributed databases; Educational institutions; Frequency estimation; Load management;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    INFOCOM, 2014 Proceedings IEEE
  • Conference_Location
    Toronto, ON
  • Type

    conf

  • DOI
    10.1109/INFOCOM.2014.6848141
  • Filename
    6848141