Author :
Ryden, Mathew ; Kwangsung Oh ; Chandra, Aniruddha ; Weissman, J.
Author_Institution :
Comput. Sci. & Eng., Univ. of Minnesota, Minneapolis, MN, USA
Abstract :
Today, centralized data-centers or clouds have become the de-facto platform for data-intensive computing in the commercial, and increasingly, scientific domains. This is because clouds such as Amazon AWS and Microsoft Azure offer large amounts of monetized co-located computation and storage well suited to typical processing tasks such as batch analytics. However, many Big Data applications rely on data that is geographically distributed, and is not collocated with the centralized computational resources provided by clouds. Examples of such applications include analysis of user data such as blogs, video feeds taken from geographically separated cameras, monitoring and log analysis of server and content distribution network (CDN) logs, and scientific data collected from distributed instruments and sensors. Such applications lead to a number of challenges for efficient data analytics in today´s cloud platforms. First, in many applications, data is both large and widely distributed and data upload may constitute a non-trivial portion of the execution time. Second, centralized cloud resources present a single point of failure and network partitions between the data sources and the cloud can also lead to service disruptions. Third, the cost to transport, store, and process data may be outside of the budget of the small-scale application designer or end-user. The paper present Nebula: a dispersed edge cloud infrastructure that provides both computation and data storage to address the above challenges. The use of edge resources is attractive for several reasons. First, there is an increasing amount of computing and storage resources available on the edge, as evidenced by their use in several volunteer computing, filesharing as well as content delivery (CDN) environments. This capacity is likely to increase further with the provision of powerful multi-core, multi-node desktop and home machines coupled with increasing amount of high bandwidth Internet connectivity. Second, - dge resources provide locality to data and users naturally, and hence, can be exploited easily for insitu processing. Finally, if cost is an issue, then volunteer edge resources can be utilized at a relatively low cost.
Keywords :
Big Data; cloud computing; distributed processing; Amazon AWS; Big Data applications; CDN logs; Microsoft Azure; Nebula infrastructure; batch analytics; centralized cloud resources; centralized clouds; centralized data-centers; computing resources; content distribution network; data-intensive computing; distributed edge cloud infrastructure; multicore multinode desktop; network partitions; storage resources; volunteer computing; Cloud computing; Conferences; Distributed databases; Monitoring; Peer-to-peer computing; Robustness;