Title :
Keynote address: Divide and Recombine: An approach for analyzing large datasets
Abstract :
Summary form only given. Analyzing large datasets is often difficult because systems and algorithms do not scale. Even routine processing tasks are difficult to run and may take a long time. Many common analytical algorithms cannot be applied to large datasets because they are either superlinear in time or space. In this talk I will describe our approach for analyzing large datasets that we call Divide and Recombine (D&R). D&R is built using RHIPE, a system that runs parallel R map-reduce jobs using Hadoop. We use D&R to run virtual experiments over large datasets. In a virtual experiment, we sample the data using a technique from experimental design, we then analyze the results of that experiment, and finally combine all the experiments into a single result. This is joint work with Bill Cleveland.
Keywords :
data analysis; parallel processing; D&R datasets; Hadoop; RHIPE; analytical algorithms; divide and recombine datasets; large dataset analysis; parallel R map-reduce jobs; routine processing tasks; virtual experiments;
Conference_Titel :
Large Data Analysis and Visualization (LDAV), 2012 IEEE Symposium on
Conference_Location :
Seattle, WA
Print_ISBN :
978-1-4673-4732-7
DOI :
10.1109/LDAV.2012.6378969