Efficient and Customizable Data Partitioning Framework for Distributed Big RDF Data Processing in the Cloud

Author

Kisung Lee ; Ling Liu ; Yuzhe Tang ; Qi Zhang ; Yang Zhou

Author_Institution

Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA, USA

fYear

2013

fDate

June 28 2013-July 3 2013

Firstpage

327

Lastpage

334

Abstract

Big data business can leverage and benefit from the Clouds, the most optimized, shared, automated, and virtualized computing infrastructures. One of the important challenges in processing big data in the Clouds is how to effectively partition the big data to ensure efficient distributed processing of the data. In this paper we present a Scalable and yet customizable data PArtitioning framework, called SPA, for distributed processing of big RDF graph data. We choose big RDF datasets as our focus of the investigation for two reasons. First, the Linking Open Data cloud has put forwards a good number of big RDF datasets with tens of billions of triples and hundreds of millions of links. Second, such huge RDF graphs can easily overwhelm any single server due to the limited memory and CPU capacity and exceed the processing capacity of many conventional data processing software systems. Our data partitioning framework has two unique features. First, we introduce a suite of vertexcentric data partitioning building blocks to allow efficient and yet customizable partitioning of large heterogeneous RDF graph data. By efficient, we mean that the SPA data partitions can support fast processing of big data of different sizes and complexity. By customizable, we mean that the SPA partitions are adaptive to different query types. Second, we propose a selection of scalable techniques to distribute the building block partitions across a cluster of compute nodes in a manner that minimizes inter-node communication cost by localizing most of the queries on distributed partitions. We evaluate our data partitioning framework and algorithms through extensive experiments using both benchmark and real datasets. Our experimental results show that the SPA data partitioning framework is not only efficient for partitioning and distributing big RDF datasets of diverse sizes and structures but also effective for processing big data queries of different types and complexity.

Keywords

cloud computing; data handling; distributed processing; graph theory; SPA data partitioning; big RDF graph data; building block partition; cloud computing; distributed big RDF data processing; internode communication cost; vertex-centric data partitioning; Benchmark testing; Data handling; Data storage systems; Distributed databases; Information management; Query processing; Resource description framework;

fLanguage

English

Publisher

ieee

Conference_Titel

Cloud Computing (CLOUD), 2013 IEEE Sixth International Conference on

Conference_Location

Santa Clara, CA

Print_ISBN

978-0-7695-5028-2

Type

conf

DOI

10.1109/CLOUD.2013.63

Filename

6676711