DocumentCode :
650615
Title :
Efficient and Customizable Data Partitioning Framework for Distributed Big RDF Data Processing in the Cloud
Author :
Kisung Lee ; Ling Liu ; Yuzhe Tang ; Qi Zhang ; Yang Zhou
Author_Institution :
Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA, USA
fYear :
2013
fDate :
June 28 2013-July 3 2013
Firstpage :
327
Lastpage :
334
Abstract :
Big data business can leverage and benefit from the Clouds, the most optimized, shared, automated, and virtualized computing infrastructures. One of the important challenges in processing big data in the Clouds is how to effectively partition the big data to ensure efficient distributed processing of the data. In this paper we present a Scalable and yet customizable data PArtitioning framework, called SPA, for distributed processing of big RDF graph data. We choose big RDF datasets as our focus of the investigation for two reasons. First, the Linking Open Data cloud has put forwards a good number of big RDF datasets with tens of billions of triples and hundreds of millions of links. Second, such huge RDF graphs can easily overwhelm any single server due to the limited memory and CPU capacity and exceed the processing capacity of many conventional data processing software systems. Our data partitioning framework has two unique features. First, we introduce a suite of vertexcentric data partitioning building blocks to allow efficient and yet customizable partitioning of large heterogeneous RDF graph data. By efficient, we mean that the SPA data partitions can support fast processing of big data of different sizes and complexity. By customizable, we mean that the SPA partitions are adaptive to different query types. Second, we propose a selection of scalable techniques to distribute the building block partitions across a cluster of compute nodes in a manner that minimizes inter-node communication cost by localizing most of the queries on distributed partitions. We evaluate our data partitioning framework and algorithms through extensive experiments using both benchmark and real datasets. Our experimental results show that the SPA data partitioning framework is not only efficient for partitioning and distributing big RDF datasets of diverse sizes and structures but also effective for processing big data queries of different types and complexity.
Keywords :
cloud computing; data handling; distributed processing; graph theory; SPA data partitioning; big RDF graph data; building block partition; cloud computing; distributed big RDF data processing; internode communication cost; vertex-centric data partitioning; Benchmark testing; Data handling; Data storage systems; Distributed databases; Information management; Query processing; Resource description framework;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cloud Computing (CLOUD), 2013 IEEE Sixth International Conference on
Conference_Location :
Santa Clara, CA
Print_ISBN :
978-0-7695-5028-2
Type :
conf
DOI :
10.1109/CLOUD.2013.63
Filename :
6676711
Link To Document :
بازگشت