• DocumentCode
    587601
  • Title

    Workload characterization on a production Hadoop cluster: A case study on Taobao

  • Author

    Zujie Ren ; Xianghua Xu ; Jian Wan ; Weisong Shi ; Min Zhou

  • Author_Institution
    Sch. of Comput. Sci. & Technol., Hangzhou Dianzi Univ., Hangzhou, China
  • fYear
    2012
  • fDate
    4-6 Nov. 2012
  • Firstpage
    3
  • Lastpage
    13
  • Abstract
    MapReduce is becoming the state-of-the-art computing paradigm for processing large-scale datasets on a large cluster with tens or thousands of nodes. It has been widely used in various fields such as e-commerce, Web search, social networks, and scientific computation. Understanding the characteristics of MapReduce workloads is the key to achieving better configuration decisions and improving the system throughput. However, workload characterization of MapReduce, especially in a large-scale production environment, has not been well studied yet. To gain insight on MapReduce workloads, we collected a two-week workload trace from a 2,000-node Hadoop cluster at Taobao, which is the biggest online e-commerce enterprise in Asia, ranked 14th in the world as reported by Alexa. The workload trace covered 912,157 jobs, logged from Dec. 4 to Dec. 20, 2011. We characterized the workload at the granularity of job and task, respectively and concluded with a set of interesting observations. The results of workload characterization are representative and generally consistent with data platforms for e-commerce websites, which can help other researchers and engineers understand the performance and job characteristics of Hadoop in their production environments. In addition, we use these job analysis statistics to derive several implications for potential performance optimization solutions.
  • Keywords
    electronic commerce; optimisation; pattern clustering; production engineering computing; social networking (online); statistical analysis; Alexa; Asia; MapReduce workloads; Taobao; Web search; configuration decisions; data platforms; e-commerce Web sites; job analysis statistics; job granularity; large-scale dataset processing; large-scale production environment; online e-commerce enterprise; performance optimization solutions; production Hadoop cluster; social networks; state-of-the-art computing paradigm; workload characterization; Educational institutions; Google; Log-normal distribution; Production; Resource management; Synchronization; Hadoop; MapReduce; workload characterization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Workload Characterization (IISWC), 2012 IEEE International Symposium on
  • Conference_Location
    La Jolla, CA
  • Print_ISBN
    978-1-4673-4531-6
  • Type

    conf

  • DOI
    10.1109/IISWC.2012.6402895
  • Filename
    6402895