DocumentCode :
2135729
Title :
Concentric Layout, a New Scientific Data Distribution Scheme in Hadoop File System
Author :
Cheng, Lu ; Shang, Pengju ; Sehrish, Saba ; Mackey, Grant ; Wang, Jun
Author_Institution :
Univ. of Central Florida, Orlando, FL, USA
fYear :
2010
fDate :
15-17 July 2010
Firstpage :
231
Lastpage :
239
Abstract :
The data generated by scientific simulation, sensor, monitor or optical telescope has increased with dramatic speed. In order to analyze the raw data fast and space efficiently, data pre-process operation is needed to achieve better performance in data analysis phase. Current research shows an increasing tread of adopting MapReduce framework for large scale data processing. However, the data access patterns which generally applied to scientific data set are not supported by current MapReduce framework directly. The gap between the requirement from analytics application and the property of MapReduce framework motivates us to provide support for these data access patterns in MapReduce framework. In our work, we studied the data access patterns in matrix files and proposed a new concentric data layout solution to facilitate matrix data access and analysis in MapReduce framework. Concentric data layout is a hierarchical data layout which maintains the dimensional property in large data sets. Contrary to the continuous data layout adopted in current Hadoop framework, concentric data layout stores the data from the same sub-matrix into one chunk, and then stores chunks symmetrically in a higher level. This matches well with the matrix like computation. The concentric data layout preprocesses the data beforehand, and optimizes the afterward run of MapReduce application. The experiments show that the concentric data layout improves the overall performance, reduces the execution time by about 38% when reading a 64 GB file. It also mitigates the unused data read overhead and increases the useful data efficiency by 32% on average.
Keywords :
data analysis; distributed processing; file organisation; information retrieval; matrix algebra; Hadoop file system; MapReduce framework; concentric data layout solution; data access pattern; data analysis; data pre-process operation; large scientific data set; matrix data access; matrix files; optical telescope; scientific data distribution scheme; scientific simulation; Analytical models; Arrays; Computational modeling; Data models; Distributed databases; File systems; Layout;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Networking, Architecture and Storage (NAS), 2010 IEEE Fifth International Conference on
Conference_Location :
Macau
Print_ISBN :
978-1-4244-8133-0
Type :
conf
DOI :
10.1109/NAS.2010.59
Filename :
5575650
Link To Document :
بازگشت