A Comparative Study of Data Processing Approaches for Text Processing Workflows

Author

Ting Chen ; Taura, Koichi

Author_Institution

Univ. of Tokyo, Tokyo, Japan

fYear

2012

fDate

10-16 Nov. 2012

Firstpage

1260

Lastpage

1267

Abstract

Workflows are widely used in data-intensive applications since it facilities the composition of individual executables or scripts, providing an easy-to-use parallelization to domain experts. With considerable popularity of MapReduce framework, some researchers start to develop MapReduce-enabled workflows instead of general file-based ones. Meanwhile, being commercially available for nearly two decades for large-scale data processing, parallel database systems have also gotten wide attention in the support of workflows. This paper studies three realworld text processing workflows and develops them on top of several different large data processing approaches including an open source MapReduce implementation-Hadoop, a work-flow-oriented parallel database system - Parallel, and a hybrid of MapReduce and parallel DBMS - Hive. We discuss their strength/weaknesses both in terms of programmability and performance for each workflow. Our experiences and experimental results reveal some interesting trade-offs: (1) High-level query languages (SQL of Parallel and HiveQL of Hive) are helpful for expressing data selection, aggregation and calculation by typical executables; (2) To reuse existing NLP tools, it is often important to be able to track the association between a document and its annotation attached by the tool, for which the expressiveness of SQL is particularly useful; (3) Each system has similar performance in the execution of overall workflows because essentially performing executables takes most of the time, but some small differences could reveal some potential trade-offs that each system entails for workflows.

Keywords

SQL; database management systems; natural language processing; parallel programming; text analysis; Hadoop; Hive DBMS; HiveQL; MapReduce framework; NLP tool; SQL; Structured Query Language; data aggregation; data calculation; data processing approach; data selection; data-intensive application; large-scale data processing; natural language processing; parallel database system; parallel system; parallelization; query language; text processing workflow; Data processing; MapReduce; parallel database system; text-processing workflow;

fLanguage

English

Publisher

ieee

Conference_Titel

High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:

Conference_Location

Salt Lake City, UT

Print_ISBN

978-1-4673-6218-4

Type

conf

DOI

10.1109/SC.Companion.2012.152

Filename

6495934