• DocumentCode
    2730490
  • Title

    A data locality optimization algorithm for large-scale data processing in Hadoop

  • Author

    Zhao, Yanrong ; Wang, Weiping ; Meng, Dan ; Yang, Xiufeng ; Zhang, Shubin ; Li, Jun ; Guan, Gang

  • Author_Institution
    Inst. of Comput. Technol., Grad. Univ., Beijing, China
  • fYear
    2012
  • fDate
    1-4 July 2012
  • Abstract
    Data-intensive applications are increasingly designed to execute on large computing clusters. Our previous observation on Tencent production systems has indicated that join query is one of the most important queries in large-scale data processing. When running a join query on Hive system, the job of the join query is divided into map phase and reduce phase, and requires transferring large amounts of intermediate results over the network, which is inefficient. In this paper, we proposed an algorithm called CHMJ, the general idea of the algorithm is to take advantage of data locality to accelerate calculation. It includes four parts, Data distribution strategy, Parallel HashMapJoin Algorithm, CoLocation Scheduling and Delay scheduling strategy. CHMJ has been adopted in Tencent data warehouse, and plays an important role in Tencent´s daily operations. Our relevant experiments demonstrate the feasibility and efficiency of our solution.
  • Keywords
    data handling; data warehouses; parallel processing; portals; query processing; scheduling; CHMJ algorithm; Hadoop; Hive system; Internet service portal; Tencent daily operation; Tencent data warehouse; Tencent production system; colocation scheduling; computing cluster; data distribution strategy; data locality optimization algorithm; data-intensive application; delay scheduling strategy; join query; large-scale data processing; map phase; parallel hashmapjoin algorithm; reduce phase; Algorithm design and analysis; Clustering algorithms; Data processing; Delay; Partitioning algorithms; Query processing; Scheduling; Hadoop; MapReduce; join query;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computers and Communications (ISCC), 2012 IEEE Symposium on
  • Conference_Location
    Cappadocia
  • ISSN
    1530-1346
  • Print_ISBN
    978-1-4673-2712-1
  • Electronic_ISBN
    1530-1346
  • Type

    conf

  • DOI
    10.1109/ISCC.2012.6249372
  • Filename
    6249372