DocumentCode :
1916546
Title :
A Comparative Study of Data Processing Approaches for Text Processing Workflows
Author :
Ting Chen ; Taura, Koichi
Author_Institution :
Univ. of Tokyo, Tokyo, Japan
fYear :
2012
fDate :
10-16 Nov. 2012
Firstpage :
1260
Lastpage :
1267
Abstract :
Workflows are widely used in data-intensive applications since it facilities the composition of individual executables or scripts, providing an easy-to-use parallelization to domain experts. With considerable popularity of MapReduce framework, some researchers start to develop MapReduce-enabled workflows instead of general file-based ones. Meanwhile, being commercially available for nearly two decades for large-scale data processing, parallel database systems have also gotten wide attention in the support of workflows. This paper studies three realworld text processing workflows and develops them on top of several different large data processing approaches including an open source MapReduce implementation-Hadoop, a work-flow-oriented parallel database system - Parallel, and a hybrid of MapReduce and parallel DBMS - Hive. We discuss their strength/weaknesses both in terms of programmability and performance for each workflow. Our experiences and experimental results reveal some interesting trade-offs: (1) High-level query languages (SQL of Parallel and HiveQL of Hive) are helpful for expressing data selection, aggregation and calculation by typical executables; (2) To reuse existing NLP tools, it is often important to be able to track the association between a document and its annotation attached by the tool, for which the expressiveness of SQL is particularly useful; (3) Each system has similar performance in the execution of overall workflows because essentially performing executables takes most of the time, but some small differences could reveal some potential trade-offs that each system entails for workflows.
Keywords :
SQL; database management systems; natural language processing; parallel programming; text analysis; Hadoop; Hive DBMS; HiveQL; MapReduce framework; NLP tool; SQL; Structured Query Language; data aggregation; data calculation; data processing approach; data selection; data-intensive application; large-scale data processing; natural language processing; parallel database system; parallel system; parallelization; query language; text processing workflow; Data processing; MapReduce; parallel database system; text-processing workflow;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:
Conference_Location :
Salt Lake City, UT
Print_ISBN :
978-1-4673-6218-4
Type :
conf
DOI :
10.1109/SC.Companion.2012.152
Filename :
6495934
Link To Document :
بازگشت