مرکز منطقه ای اطلاع رساني علوم و فناوري - Performance evaluation of enabling logistic regression for big data with R

Abstract :

The software package R is a free, powerful, open source software package with extensive statistical computing and graphics capabilities. Due to its high-level expressiveness and multitude of domain-specific packages, R has become a popular tool for data analysis in many scientific fields. While there are a number of packages enabling running R in parallel using message passing interface across multiple nodes, only few packages extend R to the new system and computing paradigm for data intensive computing, such as Hadoop and Spark. In this paper, we focus on three approaches RHadoop, RHIPE and SparkR that can scale R to distributed computing systems for solving Big Data problems. We presented an algorithm for enabling logistic regression over large set of data under MapReduce programming model. We implemented the algorithm with three packages in R to exploit the benefit of Hadoop and Spark cluster. Our implementations significantly improved the scale of the data that can be analyzed with R. We conducted a study on performance and scalability up to 1TB data with those implementations and three other common solutions for logistic regression problem. The results showed SparkR consistently outperformed other approaches and also demonstrated the advantages and limitations of each package.