DocumentCode :
669934
Title :
Characterizing the efficiency of data deduplication for big data storage management
Author :
Ruijin Zhou ; Ming Liu ; Tao Li
Author_Institution :
Univ. of Florida, Gainesville, FL, USA
fYear :
2013
fDate :
22-24 Sept. 2013
Firstpage :
98
Lastpage :
108
Abstract :
The demand for data storage and processing is increasing at a rapid speed in the big data era. Such a tremendous amount of data pushes the limit on storage capacity and on the storage network. A significant portion of the dataset in big data workloads is redundant. As a result, deduplication technology, which removes replicas, becomes an attractive solution to save disk space and traffic in a big data environment. However, the overhead of extra CPU computation (hash indexing) and IO latency introduced by deduplication should be considered. Therefore, the net effect of using deduplication for big data workloads needs to be examined. To this end, we characterize the redundancy of typical big data workloads to justify the need for deduplication. We analyze and characterize the performance and energy impact brought by deduplication under various big data environments. In our experiments, we identify three sources of redundancy in big data workloads: 1) deploying more nodes, 2) expanding the dataset, and 3) using replication mechanisms. We elaborate on the advantages and disadvantages of different deduplication layers, locations, and granularities. In addition, we uncover the relation between energy overhead and the degree of redundancy. Furthermore, we investigate the deduplication efficiency in an SSD environment for big data workloads.
Keywords :
Big Data; database indexing; power aware computing; storage management; CPU computation; IO latency; SSD environment; big data storage management; big data workloads; data deduplication; data processing; deduplication granularities; deduplication layers; deduplication locations; disk space; energy impact; energy overhead; hash indexing; replication mechanisms; storage capacity; storage network; Availability; Benchmark testing; Data handling; Data storage systems; Indexing; Information management; Redundancy; Big Data; Deduplication; Storage Management;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Workload Characterization (IISWC), 2013 IEEE International Symposium on
Conference_Location :
Portland, OR
Print_ISBN :
978-1-4799-0553-9
Type :
conf
DOI :
10.1109/IISWC.2013.6704674
Filename :
6704674
Link To Document :
بازگشت