Title :
Data Replication in Data Intensive Scientific Applications with Performance Guarantee
Author :
Nukarapu, Dharma Teja ; Tang, Bin ; Wang, Liqiang ; Lu, Shiyong
Author_Institution :
Dept. of Electr. Eng. & Comput. Sci., Wichita State Univ., Wichita, KS, USA
Abstract :
Data replication has been well adopted in data intensive scientific applications to reduce data file transfer time and bandwidth consumption. However, the problem of data replication in Data Grids, an enabling technology for data intensive applications, has proven to be NP-hard and even non approximable, making this problem difficult to solve. Meanwhile, most of the previous research in this field is either theoretical investigation without practical consideration, or heuristics-based with little or no theoretical performance guarantee. In this paper, we propose a data replication algorithm that not only has a provable theoretical performance guarantee, but also can be implemented in a distributed and practical manner. Specifically, we design a polynomial time centralized replication algorithm that reduces the total data file access delay by at least half of that reduced by the optimal replication solution. Based on this centralized algorithm, we also design a distributed caching algorithm, which can be easily adopted in a distributed environment such as Data Grids. Extensive simulations are performed to validate the efficiency of our proposed algorithms. Using our own simulator, we show that our centralized replication algorithm performs comparably to the optimal algorithm and other intuitive heuristics under different network parameters. Using GridSim, a popular distributed Grid simulator, we demonstrate that the distributed caching technique significantly outperforms an existing popular file caching technique in Data Grids, and it is more scalable and adaptive to the dynamic change of file access patterns in Data Grids.
Keywords :
cache storage; computational complexity; data analysis; data reduction; distributed processing; grid computing; GridSim; NP-hard; bandwidth consumption; data file access delay; data file transfer time reduction; data grids; data intensive scientific applications; data replication algorithm; distributed caching algorithm; distributed caching technique; distributed environment; distributed grid simulator; file access patterns; intuitive heuristics; optimal replication solution; polynomial time centralized replication algorithm; popular file caching technique; theoretical performance guarantee; Algorithm design and analysis; Bandwidth; Computational modeling; Data models; Distributed databases; Greedy algorithms; Heuristic algorithms; Data Grids; Data intensive applications; algorithm design and analysis; data replication; simulations.;
Journal_Title :
Parallel and Distributed Systems, IEEE Transactions on
DOI :
10.1109/TPDS.2010.207