Title :
Integrating Pig with Harp to Support Iterative Applications with Fast Cache and Customized Communication
Author :
Tak-Lon Wu ; Koppula, Abhilash ; Qiu, Jian
Author_Institution :
Sch. of Inf. & Comput., Indiana Univ., Bloomington, IN, USA
Abstract :
Use of high-level scripting languages to solve big data problems has become a mainstream approach for sophisticated machine learning data analysis. Often data must be used in several steps of a computation to complete a full task. Composing default data transformation operators with the standard Hadoop MapReduce runtime is very convenient. However, the current strategy of using high-level languages to support iterative applications with Hadoop MapReduce relies on an external wrapper script in other languages such as Python and Groovy, which causes significant performance loss when restarting mappers and reducers between jobs. In this paper, we reduce the extra job startup overheads by integrating Apache Pig with the high-performance Hadoop plug-in Harp developed at Indiana University. This provides fast data caching and customized communication patterns among iterations for data analysis. The results show performance improvements of factors from 2 to 5.
Keywords :
authoring languages; iterative methods; learning (artificial intelligence); Apache Pig; Groovy; Hadoop MapReduce; Harp; Indiana University; Python; customized communication; data transformation operators; external wrapper script; fast cache; high-level scripting languages; integrating pig; iterative applications; machine learning data analysis; Computational modeling; Iterative methods; Java; Loading; Programming; Syntactics; Pig; Iterative Algorithms; Big Data; Language; MapReduce;
Conference_Titel :
Data-Intensive Computing in the Clouds (DataCloud), 2014 5th International Workshop on
Conference_Location :
New Orleans, LA
DOI :
10.1109/DataCloud.2014.8