Title :
Perldoop: Efficient execution of Perl scripts on Hadoop clusters
Author :
Abuin, Jose M. ; Pichel, Juan C. ; Pena, Tomas F. ; Gamallo, Pablo ; GarciÌa, Marcos
Author_Institution :
Centro de Investig. en Tecnoloxias da Informacion, Univ. de Santiago de Compostela, Santiago de Compostela, Spain
Abstract :
Hadoop is one of the most important implementations of the MapReduce programming model. It is written in Java and most of the programs that run on Hadoop are also written in this language. Hadoop also provides an utility to execute applications written in other languages, known as Hadoop Streaming. However, the ease of use provided by Hadoop Streaming comes at the expense of a noticeable degradation in the performance. In this work, we introduce Perldoop, a new tool that automatically translates Hadoop-ready Perl scripts into its Java counterparts, which can be directly executed on Hadoop while improving their performance significantly. We have tested our tool using several Natural Language Processing (NLP) modules, which consist of hundreds of regular expressions, but Perldoop could be used with any Perl code ready to be executed with Hadoop Streaming. Performance results show that Java codes generated using Perldoop execute up to 12x faster than the original Perl modules using Hadoop Streaming. In this way, the new NLP modules are able to process the whole Wikipedia in less than 2 hours using a Hadoop cluster with 64 nodes.
Keywords :
Internet; Java; data handling; natural language processing; parallel processing; Hadoop Streaming; Hadoop clusters; Hadoop-ready Perl scripts; Java codes; MapReduce programming model; NLP modules; Perl code; Perl modules; Perldoop; Wikipedia; natural language processing; Arrays; Internet; Java; Natural language processing; Pragmatics; Programming; Reactive power;
Conference_Titel :
Big Data (Big Data), 2014 IEEE International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/BigData.2014.7004303