Title :
PipeFlow Engine: Pipeline Scheduling with Distributed Workflow Made Simple
Author :
Yin Li ; Chuang Lin
Author_Institution :
Tsinghua Nat. Lab. for Inf. Sci. & Technol. (TNList), Tsinghua Univ., Beijing, China
Abstract :
Distributed computing system is considered as a fundamental architecture to extend resources such as computation speed, storage capacity, and network bandwidth, which are limited for a single processor. Emerging big data processing techniques like Hadoop take advantages of distributed servers to accomplish scalable parallel computations. Large-scale processing jobs can run on different servers or even different clusters interdependently and be combined together as a workflow to provide meaningful outputs. In this paper, we analyze the common demands of big-data processing and distributed big-data workflow processing. According to that, we design Pipe Flow Engine that has the matching features to meet each of these demands. It orchestrates all involved jobs and schedules them in a batched pipeline mode. We also present two online ranking algorithms that make use of the Pipe Flow, sharing the experience and best practice of using Pipe Flow.
Keywords :
Big Data; parallel processing; pipeline processing; processor scheduling; Hadoop; big data processing techniques; distributed big-data workflow processing; distributed computing system; distributed servers; distributed workflow; fundamental architecture; large-scale processing jobs; online ranking algorithms; parallel computations; pipeflow engine; pipeline scheduling; Data handling; Data storage systems; Engines; Information management; Measurement; Pipelines; Servers; PipeFlow; performance; pipeline; workflow;
Conference_Titel :
Parallel and Distributed Systems (ICPADS), 2013 International Conference on
Conference_Location :
Seoul
DOI :
10.1109/ICPADS.2013.31