Title :
Deploying Large-Scale Datasets on-Demand in the Cloud: Treats and Tricks on Data Distribution
Author :
Vaquero, Luis M. ; Celorio, Antonio ; Cuadrado, Felix ; Cuevas, Ruben
Author_Institution :
Hewlett-Packard Labs. Security & Cloud Lab., Bristol, UK
fDate :
April-June 1 2015
Abstract :
Public clouds have democratised the access to analytics for virtually any institution in the world. Virtual machines (VMs) can be provisioned on demand to crunch data after uploading into the VMs. While this task is trivial for a few tens of VMs, it becomes increasingly complex and time consuming when the scale grows to hundreds or thousands of VMs crunching tens or hundreds of TB. Moreover, the elapsed time comes at a price: the cost of provisioning VMs in the cloud and keeping them waiting to load the data. In this paper we present a big data provisioning service that incorporates hierarchical and peer-to-peer data distribution techniques to speed-up data loading into the VMs used for data processing. The system dynamically mutates the sources of the data for the VMs to speed-up data loading. We tested this solution with 1000 VMs and 100 TB of data, reducing time by at least 30 percent over current state of the art techniques. This dynamic topology mechanism is tightly coupled with classic declarative machine configuration techniques (the system takes a single high-level declarative configuration file and configures both software and data loading). Together, these two techniques simplify the deployment of big data in the cloud for end users who may not be experts in infrastructure management.
Keywords :
Big Data; cloud computing; peer-to-peer computing; virtual machines; VM; big data provisioning service; classic declarative machine configuration techniques; data loading; data processing; dynamic topology mechanism; high-level declarative configuration file; infrastructure management; large-scale datasets on-demand; peer-to-peer data distribution techniques; public clouds; virtual machines; Big data; Cloud computing; Distributed databases; Loading; Relays; Servers; BitTorrent; Large-scale data transfer; big data; big data distribution; flash crowd; p2p everyday; p2p overlay; provisioning;
Journal_Title :
Cloud Computing, IEEE Transactions on
DOI :
10.1109/TCC.2014.2360376