• DocumentCode
    3143769
  • Title

    Towards Scalable One-Pass Analytics Using MapReduce

  • Author

    Mazur, Edward ; Li, Boduo ; Diao, Yanlei ; Shenoy, Prashant

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Massachusetts, Amherst, MA, USA
  • fYear
    2011
  • fDate
    16-20 May 2011
  • Firstpage
    1102
  • Lastpage
    1111
  • Abstract
    An integral part of many data-intensive applications is the need to collect and analyze enormous datasets efficiently. Concurrent with such application needs is the increasing adoption of MapReduce as a programming model for processing large datasets using a cluster of machines. Current MapReduce systems, however, require the data set to be loaded into the cluster before running analytical queries, and thereby incur high delays to start query processing. Furthermore, existing systems are geared towards batch processing. In this paper, we seek to answer a fundamental question: what architectural changes are necessary to bring the benefits of the MapReduce computation model to incremental, one-pass analytics, i.e., to support stream processing and online aggregation? To answer this question, we first conduct a detailed empirical performance study of current MapReduce implementations including Hadoop and MapReduce Online using a variety of workloads. By doing so, we identify several drawbacks of existing systems for one-pass analytics. Based on the insights from our study, we conclude by listing key design requirements and arguing for architectural changes of MapReduce systems to overcome their current limitations and fully embrace incremental one-pass analytics and showing promising preliminary results.
  • Keywords
    data analysis; distributed processing; query processing; Hadoop; MapReduce; batch processing; data intensive applications; query processing; scalable one pass analytics; Benchmark testing; Computational modeling; Fault tolerance; Fault tolerant systems; Load modeling; Parallel processing; Sorting;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on
  • Conference_Location
    Shanghai
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-61284-425-1
  • Electronic_ISBN
    1530-2075
  • Type

    conf

  • DOI
    10.1109/IPDPS.2011.251
  • Filename
    6008898