Integrating DBMSs as a Read-Only Execution Layer into Hadoop

Author

An, Mingyuan ; Wang, Yang ; Wang, Weiping ; Sun, Ninghui

Author_Institution

Key Lab. of Comput. Syst. & Archit., Grad. Univ. of Chinese Acad. of Sci., Beijing, China

fYear

2010

fDate

8-11 Dec. 2010

Firstpage

17

Lastpage

26

Abstract

To obtain the efficiency of DBMS, HadoopDB combines Hadoop and DBMS, and claims the superiority over Hadoop in terms of performance. However, the approach of HadoopDB is simply putting Map Reduce onto unmodified single-machined DBMSs which has several obvious weaknesses. In essence, HadoopDB is a parallel DBMS with fault tolerance, which incurs unnecessary overhead due to the DBMS legacy. Instead of augmenting DBMS with Hadoop techniques, we propose a new system architecture integrating modified DBMS engines as a read-only execution layer into Hadoop, where DBMS plays a role of providing efficient read-only operators rather than managing the data. Besides the obtained efficiency from DBMS engine, there are other advantages. The modified DBMS engine is able to directly process data from the HDFS (Hadoop Distributed File System) files at the block level, which means that the data replication can be handled by HDFS naturally, and the block-level parallelism is easily achieved. The global index access mechanism is added according to the Map Reduce paradigm. The data loading speed is also guaranteed by directly writing the data into HDFS with simplified logic. Experiments show that our system outperforms both original Hadoop and HadoopDB styled system.

Keywords

data handling; fault tolerant computing; information retrieval; parallel databases; software architecture; DBMS engine; HDFS; Hadoop distributed file system; HadoopDB; MapReduce; block-level parallelism; data processing; database management system; fault tolerance; index access; parallel DBMS; read-only execution layer; single-machined DBMS; system architecture; Engines; Fault tolerance; Fault tolerant systems; Indexes; Loading; Parallel processing; Hadoop; database; global index access; large-scale data processing;

fLanguage

English

Publisher

ieee

Conference_Titel

Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2010 International Conference on

Conference_Location

Wuhan

Print_ISBN

978-1-4244-9110-0

Electronic_ISBN

978-0-7695-4287-4

Type

conf

DOI

10.1109/PDCAT.2010.43

Filename

5704399