DocumentCode :
249487
Title :
RABID: A Distributed Parallel R for Large Datasets
Author :
Hao Lin ; Shuo Yang ; Midkiff, Samuel P.
Author_Institution :
Electr. & Comput. Eng., Purdue Univ., West Lafayette, IN, USA
fYear :
2014
fDate :
June 27 2014-July 2 2014
Firstpage :
725
Lastpage :
732
Abstract :
Large-scale data mining and deep data analysis are increasingly important for both enterprise and scientific applications. Statistical languages provide rich functionality and ease of use for data analysis and modeling and have a large user base. R is one of the most widely used of these languages, but is limited to a single threaded execution model and problem sizes that fit in a single node. This paper describes highly parallel R system called RABID (R Analytics for BIg Data) that maintains R compatibility, leverages the MapReducelike distributed Spark and achieves high performance and scaling across clusters. Our experimental evaluation shows that RABID performs up to 5x faster than Hadoop and 20x faster than RHIPE on two data mining applications.
Keywords :
data mining; distributed processing; statistical analysis; MapReducelike distributed spark; RABID; data analysis; data mining; distributed parallel R; enterprise applications; large datasets; scientific applications; single threaded execution model; statistical languages; Data structures; Distributed databases; Fault tolerance; Fault tolerant systems; Programming; Servers; Sparks; Big Data analytics; Data mining; Distributed Computing; R;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Big Data (BigData Congress), 2014 IEEE International Congress on
Conference_Location :
Anchorage, AK
Print_ISBN :
978-1-4799-5056-0
Type :
conf
DOI :
10.1109/BigData.Congress.2014.107
Filename :
6906850
Link To Document :
بازگشت