Title :
Distributed Diversification of Large Datasets
Author :
Hasan, Mohammed ; Mueen, Abdullah ; Tsotras, Vassilis
Author_Institution :
Univ. of California, Riverside, Riverside, CA, USA
Abstract :
Diversification has been recently proposed as an approach to allow a user to better grasp a large result set without having to look through all relevant results. In this paper, we expand the use of diversification as an analytical tool to explore large datasets dispersed over many nodes. The diversification problem is in general NP-complete and existing uniprocessor algorithms are unfortunately not suitable for the distributed setting of our environment. Using the MapReduce framework we consider two distinct approaches to solve the distributed diversification problem, one that focuses on optimizing disk I/O and one that optimizes for network I/O. Our approaches are iterative in nature, allowing the user to continue refining the diversification process if more time is available. Moreover, we prove that (i) this iteration process converges and (ii) it produces a 2-approximate diversified result set when compared to the optimal solution. We also develop a cost model to predict the run-time for both approaches based on the network and disk characteristics. We implemented our approaches on a cluster of 40 cores and showed that they are scalable and produce the same quality results as the state-of-the-art uniprocessor algorithms.
Keywords :
computational complexity; data analysis; optimisation; parallel algorithms; 2-approximate diversified result set; MapReduce framework; NP-complete problem; analytical tool; core cluster; cost model; disk characteristics; distributed diversification problem; large datasets; network I/O optimization; network characteristics; parallel processing; run-time prediction; uniprocessor algorithms; Approximation algorithms; Clustering algorithms; Computational modeling; Cost benefit analysis; Data models; Iterative methods; Partitioning algorithms; Diversity; MapReduce; Parallel Processing;
Conference_Titel :
Cloud Engineering (IC2E), 2014 IEEE International Conference on
Conference_Location :
Boston, MA
DOI :
10.1109/IC2E.2014.19