• DocumentCode
    253244
  • Title

    Data locality in MapReduce: A network perspective

  • Author

    Weina Wang ; Lei Ying

  • Author_Institution
    Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA
  • fYear
    2014
  • fDate
    Sept. 30 2014-Oct. 3 2014
  • Firstpage
    1110
  • Lastpage
    1117
  • Abstract
    In MapReduce, placing computation near its input data is considered to be desirable since otherwise the data transmission introduces an additional delay to the task execution. This data locality problem has been studied in the literature. Most existing scheduling algorithms in MapReduce focus on improving performance through increasing locality. In this paper, we view the data locality problem from a network perspective. The key observation is that if we make appropriate use of the network to route the data chunk to the machine where it will be processed in advance, then processing a remote task is the same as processing a local task. In other words, instead of bringing computation close to data, we can also bring data close to computation to improve the system performance. However, to benefit from such a strategy, we must (i) balance the tasks assigned to local machines and those assigned to remote machines, and (ii) design the routing algorithm to avoid network congestion. Taking these challenges into consideration, we propose a scheduling/routing algorithm, named the Joint Scheduler, which utilizes both the computing resources and the communication network efficiently. To show that the Joint Scheduler has superior performance, we prove that the Join Scheduler can support any load that can be supported by some other algorithm, i.e., achieve the maximum capacity region. Simulation results demonstrate that with popularity skew, the Joint Scheduler improves the throughput significantly (more than 30% in our simulations) compared to the Hadoop Fair Scheduler with delay scheduling, which is the de facto industry standard. The delay performance is also evaluated through simulations, where we can see a significant delay reduce under the Joint Scheduler with moderate to heavy load.
  • Keywords
    data analysis; parallel processing; scheduling; telecommunication network routing; workstation clusters; Hadoop Fair Scheduler; Joint Scheduler; MapReduce computing cluster; data locality problem; data transmission; delay scheduling; popularity skew; scheduling-routing algorithm; Bandwidth; Communication networks; Joints; Processor scheduling; Routing; Scheduling; Switches;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Communication, Control, and Computing (Allerton), 2014 52nd Annual Allerton Conference on
  • Conference_Location
    Monticello, IL
  • Type

    conf

  • DOI
    10.1109/ALLERTON.2014.7028579
  • Filename
    7028579