Beyond Simple Integration of RDBMS and MapReduce -- Paving the Way toward a Unified System for Big Data Analytics: Vision and Progress

Author

Xiongpai Qin ; Huiju Wang ; Furong Li ; Baoyao Zhou ; Yu Cao ; Cuiping Li ; Hong Chen ; Xuan Zhou ; Xiaoyong Du ; Shan Wang

Author_Institution

Key Lab. of Data Eng. & Knowledge Eng. (RUC), Beijing, China

fYear

2012

fDate

1-3 Nov. 2012

Firstpage

716

Lastpage

725

Abstract

MapReduce has shown vigorous vitality and penetrated both academia and industry in recent years. MapReduce not only can be used as an ETL tool, it can do even much more. The technique has been applied to SQL summation, OLAP, data mining, machine learning, information retrieval, multimedia data processing, science data processing etc. Basically MapReduce is a general purpose parallel computing framework for large dataset processing. A big data analytics ecosystem built around MapReduce is emerging alongside the traditional one built around RDBMS. The objectives of RDBMS and MapReduce, as well as the ecosystems built around them, overlap much really, in some sense they do the same thing and MapReduce can accomplish more works, such as graph processing, which RDBMS can not handle well. RBDMS enjoys high performance of relational data processing, which MapReduce needs to catch up. The authors envision that the two techniques are fusing into a unified system for big data analytics. With the ongoing endeavor to build up the system, much of the groundwork has been laid while some critical issues are still unresolved, we try to identify some of them. Two of our works as well as experiment results are presented, one is applying a hierarchical encoding to star schema data in Hadoop for high performance of OLAP processing, another is leveraging the natural three copies of HDFS blocks to exploit different data layouts to speed up queries in a OLAP workload, a cost model is used to route user queries to different data layouts.

Keywords

SQL; data analysis; data mining; ecology; information retrieval; learning (artificial intelligence); multimedia computing; parallel processing; relational databases; ETL tool; HDFS blocks; Hadoop; MapReduce; OLAP processing; OLAP workload; RDBMS; SQL summation; data analytics ecosystem; data layouts; data mining; ecosystems; general purpose parallel computing framework; graph processing; hierarchical encoding; information retrieval; large dataset processing; machine learning; multimedia data processing; relational data processing; science data processing; star schema data; user query; Data handling; Data mining; Data processing; Data storage systems; Databases; Information management; Layout; Big Data Analytics; MapReduce; OLAP; RDBMS; Unified System; Vision;

fLanguage

English

Publisher

ieee

Conference_Titel

Cloud and Green Computing (CGC), 2012 Second International Conference on

Conference_Location

Xiangtan

Print_ISBN

978-1-4673-3027-5

Type

conf

DOI

10.1109/CGC.2012.39

Filename

6382895