PipeFlow Engine: Pipeline Scheduling with Distributed Workflow Made Simple

Author

Yin Li ; Chuang Lin

Author_Institution

Tsinghua Nat. Lab. for Inf. Sci. & Technol. (TNList), Tsinghua Univ., Beijing, China

fYear

2013

fDate

15-18 Dec. 2013

Firstpage

142

Lastpage

149

Abstract

Distributed computing system is considered as a fundamental architecture to extend resources such as computation speed, storage capacity, and network bandwidth, which are limited for a single processor. Emerging big data processing techniques like Hadoop take advantages of distributed servers to accomplish scalable parallel computations. Large-scale processing jobs can run on different servers or even different clusters interdependently and be combined together as a workflow to provide meaningful outputs. In this paper, we analyze the common demands of big-data processing and distributed big-data workflow processing. According to that, we design Pipe Flow Engine that has the matching features to meet each of these demands. It orchestrates all involved jobs and schedules them in a batched pipeline mode. We also present two online ranking algorithms that make use of the Pipe Flow, sharing the experience and best practice of using Pipe Flow.

Keywords

Big Data; parallel processing; pipeline processing; processor scheduling; Hadoop; big data processing techniques; distributed big-data workflow processing; distributed computing system; distributed servers; distributed workflow; fundamental architecture; large-scale processing jobs; online ranking algorithms; parallel computations; pipeflow engine; pipeline scheduling; Data handling; Data storage systems; Engines; Information management; Measurement; Pipelines; Servers; PipeFlow; performance; pipeline; workflow;

fLanguage

English

Publisher

ieee

Conference_Titel

Parallel and Distributed Systems (ICPADS), 2013 International Conference on

Conference_Location

Seoul

ISSN

1521-9097

Type

conf

DOI

10.1109/ICPADS.2013.31

Filename

6808168