DocumentCode
3717164
Title
Parallel meta-blocking: Realizing scalable entity resolution over large, heterogeneous data
Author
Vasilis Efthymiou;George Papadakis;George Papastefanatos;Kostas Stefanidis;Themis Palpanas
Author_Institution
ICS-FORTH, Greece
fYear
2015
Firstpage
411
Lastpage
420
Abstract
Entity resolution constitutes a crucial task for many applications, but has an inherently quadratic complexity. Typically, it scales to large volumes of data through blocking: similar entities are clustered into blocks so that it suffices to perform comparisons only within each block. Meta-blocking further increases efficiency by cleaning the overlapping blocks from unnecessary comparisons. However, even Meta-blocking can be time-consuming: applying it to blocks with 7.4 million entities and 2.21011 comparisons takes almost 8 days on a modern high-end server. In this paper, we parallelize Meta-blocking based on MapReduce. We propose a simple strategy that explicitly creates the core concept of Meta-blocking, the blocking graph. We then describe an advanced strategy that creates the blocking graph implicitly, reducing the overhead of data exchange. We also introduce a load balancing algorithm that distributes the computationally intensive workload evenly among the available compute nodes. Our experimental analysis verifies the superiority of our advanced strategy and demonstrates an almost linear speedup for all meta-blocking techniques with respect to the number of available nodes.
Keywords
"Erbium","Scalability","Servers","Load management","Big data","Complexity theory","Context"
Publisher
ieee
Conference_Titel
Big Data (Big Data), 2015 IEEE International Conference on
Type
conf
DOI
10.1109/BigData.2015.7363782
Filename
7363782
Link To Document