• DocumentCode
    2487106
  • Title

    Keynote address: Divide and Recombine: An approach for analyzing large datasets

  • Author

    Hanrahan, P.

  • fYear
    2012
  • fDate
    14-15 Oct. 2012
  • Firstpage
    1
  • Lastpage
    1
  • Abstract
    Summary form only given. Analyzing large datasets is often difficult because systems and algorithms do not scale. Even routine processing tasks are difficult to run and may take a long time. Many common analytical algorithms cannot be applied to large datasets because they are either superlinear in time or space. In this talk I will describe our approach for analyzing large datasets that we call Divide and Recombine (D&R). D&R is built using RHIPE, a system that runs parallel R map-reduce jobs using Hadoop. We use D&R to run virtual experiments over large datasets. In a virtual experiment, we sample the data using a technique from experimental design, we then analyze the results of that experiment, and finally combine all the experiments into a single result. This is joint work with Bill Cleveland.
  • Keywords
    data analysis; parallel processing; D&R datasets; Hadoop; RHIPE; analytical algorithms; divide and recombine datasets; large dataset analysis; parallel R map-reduce jobs; routine processing tasks; virtual experiments;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Large Data Analysis and Visualization (LDAV), 2012 IEEE Symposium on
  • Conference_Location
    Seattle, WA
  • Print_ISBN
    978-1-4673-4732-7
  • Type

    conf

  • DOI
    10.1109/LDAV.2012.6378969
  • Filename
    6378969