Title :
Improving load balancing for MapReduce-based entity matching
Author :
Gomes Mestre, Demetrio ; Pires, Carlos Eduardo Santos
Author_Institution :
Inf. Syst. & Databases Group (SINBAD, Fed. Univ. of Campina Grande (UFCG), Campina Grande, Brazil
Abstract :
The effectiveness and scalability of MapReduce-based implementations for data-intensive tasks depends on the data assignment made from map to reduce tasks. The robustness of this assignment strategy is crucial to achieve skewed data handling and balanced workload distribution among all reduce tasks. For the entity matching problem in the Big Data context, we propose BlockSlicer, a MapReduce-based approach that supports blocking techniques to reduce the entity matching search space. The approach utilizes a preprocessing MapReduce job to analyze the data distribution and provides an improved load balancing by applying an efficient block slice strategy as well as a well-known optimization algorithm to assign the generated match tasks. We evaluate the approach against an existing one that addresses the same problem on a real cloud infrastructure. The results show that our approach increases significantly the performance of distributed entity matching task by reducing the amount of data generated from the map phase and diminishing the overall execution time.
Keywords :
Big Data; cloud computing; optimisation; parallel programming; pattern matching; resource allocation; Big Data context; BlockSlicer; MapReduce-based entity matching; MapReduce-based implementation; balanced workload distribution; block slice strategy; blocking techniques; data assignment; data distribution; data-intensive tasks; distributed entity matching task performance; entity matching search space reduction; execution time; load balancing; map phase; optimization algorithm; preprocessing MapReduce job; real cloud infrastructure; skewed data handling; task reduction; Context; Data handling; Indexes; Load management; Optimization; Programming; Entity Matching; Improvement; Load Balancing; MapReduce;
Conference_Titel :
Computers and Communications (ISCC), 2013 IEEE Symposium on
Conference_Location :
Split
DOI :
10.1109/ISCC.2013.6755016