Building a Massive Stream Computing Platform for Flexible Applications

Author

Tianjian Chen ; Zhengrui Man ; Hao Li ; Xin Sun ; Wong, Raymond K. ; Zhiwei Yu

Author_Institution

Dept. of Basic Technol., Baidu Inc., Beijing, China

fYear

2014

fDate

June 27 2014-July 2 2014

Firstpage

414

Lastpage

421

Abstract

Driven by the rapid growth of large scale real-time data mining applications for personalized ads and content recommendations, distributed stream processing systems are widely applied in modern big-data architectures. Designs of existing stream computing systems are mostly focusing on the scalability and availability issues. Other important issues which are essential to the actual cost and productivity, such as the fluctuating work load handling, the stream topology alternation efficiency and the computing topology overlapping, are not well studied. To address these issues in a live, production environment, a new stream processing architecture that is based on a scalability enhanced subscription model is proposed in this paper. We also present a system, called Vortex, that has been implemented using this new architecture. Vortex is a distributed stream computing system engineered to support flexible applications at Baidu. The new architecture enables Vortex to scale well for highly fluctuating workloads and perform on-demand stream topology alternations with minimal overheads. Furthermore, the dynamic message routing mechanism of Vortex allows one processing node to serve different stream topologies. This maximizes the computing resource utilization in the scenarios of topology overlapping. With all these features, Vortex is a powerful platform for both realtime data processing and Map-Reduce job acceleration. Finally, in this paper, we also discuss some applications at Baidu to demonstrate how Vortex can be deployed for various stream computing applications ranging from real-time analytics to the efficient large-scale data mining.

Keywords

Big Data; Internet; data mining; parallel programming; resource allocation; software architecture; Baidu; Internet applications; MapReduce job acceleration; Vortex; availability issue; big-data architectures; computing resource utilization maximization; content recommendations; distributed stream processing systems; dynamic message routing mechanism; flexible applications; large scale real-time data mining applications; massive stream computing platform; on-demand stream topology alternations; personalized ads; real-time data processing; scalability enhanced subscription model; topology overlapping; Computational modeling; Computer architecture; Payloads; Routing; Scalability; Servers; Topology; Architecture; DRPC; Real World Applications; Stream Computing;

fLanguage

English

Publisher

ieee

Conference_Titel

Big Data (BigData Congress), 2014 IEEE International Congress on

Conference_Location

Anchorage, AK

Print_ISBN

978-1-4799-5056-0

Type

conf

DOI

10.1109/BigData.Congress.2014.67

Filename

6906810