Title :
On Efficiently Capturing Scientific Properties in Distributed Big Data without Moving the Data: A Case Study in Distributed Structural Biology Using MapReduce
Author :
Boyu Zhang ; Estrada, Trilce ; Cicotti, Pietro ; Taufer, Michela
Author_Institution :
Univ. of Delaware, Newark, DE, USA
Abstract :
In this paper, we present two variations of a general analysis algorithm for large datasets residing in distributed memory systems. Both variations avoid the need to move data among nodes because they extract relevant data properties locally and concurrently and transform the analysis problem (e.g., clustering or classification) into a search for property aggregates. We test the two variations using the SDSC´s supercomputer Gordon, the MapReduce-MPI library, and a structural biology dataset of 100 million protein-ligand records. We evaluate both variations for their sensitivity to data distribution and load imbalance. Our observations indicate that the first variation is sensitive to data content and distribution while the second variation is not. Moreover, the second variation can self-heal load imbalance and it outperforms the first in all the fifteen cases considered.
Keywords :
Big Data; biology computing; data analysis; distributed databases; distributed memory systems; MapReduce-MPI library; SDSC supercomputer Gordon; data content; data distribution; data property; distributed big data; distributed memory system; distributed structural biology; general analysis algorithm; large datasets; load imbalance; property aggregates; protein-ligand record; scientific property; sensitivity; structural biology dataset; Aggregates; Data mining; Distributed databases; Geometry; Proteins; Supercomputers; Big Data; Classification; Clustering; MapReduce; Molecular Dynamics; protein-ligand docking;
Conference_Titel :
Computational Science and Engineering (CSE), 2013 IEEE 16th International Conference on
Conference_Location :
Sydney, NSW
DOI :
10.1109/CSE.2013.28