Title :
Analysis and Optimization of Data Import with Hadoop
Author :
Xu, Weijia ; Luo, Wei ; Woodward, Nicholas
Author_Institution :
Texas Adv. Comput. Center, Univ. of Texas at Austin, Austin, TX, USA
Abstract :
Data driven research has become an important part of scientific discovery in an increasing number of disciplines. In many cases, the sheer volume of data to be processed requires not only state-of-the-art computing resources but also carefully tuned and specifically developed software. These requirements are often associated with huge operational costs and significant expertise in software development. Due to its simplicity for the user and effectiveness at processing big data, Hadoop has become a popular software platform for large-scale data analysis. Using a Hadoop cluster in a remote shared infrastructure enables users to avoid the costs of maintaining a physical infrastructure. An inevitable step in using dynamically constructed Hadoop cluster is the initial importing of the data. This process is not trivial, particularly when the size of the data is large. In this paper, we evaluate the costs of importing large-scale data into a Hadoop cluster. We present a detailed analysis of the default data importing implementation in Hadoop and conduct a practical evaluation. Our evaluation includes tests with different hardware configurations, such as different network protocol and disk configurations. We also propose an implementation to improve the performance of importing data into a Hadoop cluster wherein the data is accessed directly by Data nodes during the import process.
Keywords :
data analysis; optimisation; software engineering; Hadoop cluster; data driven research; data import; large-scale data analysis; optimization; scientific discovery; software development; software platform; state-of-the-art computing resources; Computational modeling; Data models; File systems; Hardware; Pipelines; Sockets; Throughput; Hadoop; cloud computing; data import; data transfter; disk I/O;
Conference_Titel :
Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International
Conference_Location :
Shanghai
Print_ISBN :
978-1-4673-0974-5
DOI :
10.1109/IPDPSW.2012.129