Title :
Efficiently Handling Skew in Outer Joins on Distributed Systems
Author :
Long Cheng ; Kotoulas, Spyros ; Ward, Tomas E. ; Theodoropoulos, Georgios
Author_Institution :
Nat. Univ. of Ireland Maynooth, Maynooth, Ireland
Abstract :
Outer joins are ubiquitous in databases and big data systems. The question of how best to execute outer joins in large parallel systems is particularly challenging as real world datasets are characterized by data skew leading to performance issues. Although skew handling techniques have been extensively studied for inner joins, there is little published work solving the corresponding problem for parallel outer joins. Conventional approaches to this problem such as ones based on hash redistribution often lead to load balancing problems while duplication-based approaches incurs significant overhead in terms of network communication. In this paper, we propose a new algorithm, query with counters (QC), for directly handling skew in outer joins on distributed architectures. We present an efficient implementation of our approach based on the asynchronous partitioned global address space (APGAS) parallel programming model. We evaluate the performance of our approach on a cluster of 192 cores (16 nodes) and datasets of 1 billion tuples with different skew. Experimental results show that our method is scalable and, in cases of high skew, faster than the state-of-the-art.
Keywords :
data handling; parallel programming; resource allocation; APGAS; QC; asynchronous partitioned global address space; big data systems; data skew; databases; distributed architectures; distributed systems; duplication-based approaches; hash redistribution; load balancing problems; network communication; parallel outer joins; parallel programming model; parallel systems; performance evaluation; performance issues; query with counters; skew handling; Density estimation robust algorithm; Distributed databases; Histograms; Probes; Radiation detectors; Scalability; Silicon; X10; data skew; distributed join; outer join; parallel join;
Conference_Titel :
Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on
Conference_Location :
Chicago, IL
DOI :
10.1109/CCGrid.2014.35