• DocumentCode
    3455937
  • Title

    Improving load balancing for MapReduce-based entity matching

  • Author

    Gomes Mestre, Demetrio ; Pires, Carlos Eduardo Santos

  • Author_Institution
    Inf. Syst. & Databases Group (SINBAD, Fed. Univ. of Campina Grande (UFCG), Campina Grande, Brazil
  • fYear
    2013
  • fDate
    7-10 July 2013
  • Abstract
    The effectiveness and scalability of MapReduce-based implementations for data-intensive tasks depends on the data assignment made from map to reduce tasks. The robustness of this assignment strategy is crucial to achieve skewed data handling and balanced workload distribution among all reduce tasks. For the entity matching problem in the Big Data context, we propose BlockSlicer, a MapReduce-based approach that supports blocking techniques to reduce the entity matching search space. The approach utilizes a preprocessing MapReduce job to analyze the data distribution and provides an improved load balancing by applying an efficient block slice strategy as well as a well-known optimization algorithm to assign the generated match tasks. We evaluate the approach against an existing one that addresses the same problem on a real cloud infrastructure. The results show that our approach increases significantly the performance of distributed entity matching task by reducing the amount of data generated from the map phase and diminishing the overall execution time.
  • Keywords
    Big Data; cloud computing; optimisation; parallel programming; pattern matching; resource allocation; Big Data context; BlockSlicer; MapReduce-based entity matching; MapReduce-based implementation; balanced workload distribution; block slice strategy; blocking techniques; data assignment; data distribution; data-intensive tasks; distributed entity matching task performance; entity matching search space reduction; execution time; load balancing; map phase; optimization algorithm; preprocessing MapReduce job; real cloud infrastructure; skewed data handling; task reduction; Context; Data handling; Indexes; Load management; Optimization; Programming; Entity Matching; Improvement; Load Balancing; MapReduce;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computers and Communications (ISCC), 2013 IEEE Symposium on
  • Conference_Location
    Split
  • Type

    conf

  • DOI
    10.1109/ISCC.2013.6755016
  • Filename
    6755016