Inline Data Deduplication for SSD-Based Distributed Storage

Author

Binqi Zhang;Chen Wang;Bing Bing Zhou;Albert Y. Zomaya

Author_Institution

Sch. of Inf. Technol., Univ. of Sydney, Sydney, NSW, Australia

fYear

2015

Firstpage

593

Lastpage

600

Abstract

Data deduplication is used to overcome two issues on Solid State Drives (SSDs). One is price per GB of storage space, and the other is the write limit or disk endurance. By eliminating duplicate data, the deduplication system improves storage efficiency and protects SSD from unnecessary writes. CAFTL is a known solution for deduplication on SSD. We propose a system architecture for inline deduplication based on existing protocol of The Hadoop Distributed File System (HDFS), aiming at addressing performance challenges for primary storage. However, simply applying CAFTL to SSDs in a cluster does not work well. Two routing algorithms are presented and evaluated using selective real-life data sets. Compared to prior work, one routing algorithm (MMHR) may improve the deduplication ratio by 8% at minimal costs while the other (FFFR) can achieve about 30% higher deduplication ratio with tradeoff on chunk level fragmentation. A new research problem of chunk assignment into more than one node for deduplication is also formulated for more studies in this area.

Keywords

"Routing","Indexes","Distributed databases","Clustering algorithms","Systems architecture","Cloud computing","Metadata"

Publisher

ieee

Conference_Titel

Parallel and Distributed Systems (ICPADS), 2015 IEEE 21st International Conference on

Electronic_ISBN

1521-9097

Type

conf

DOI

10.1109/ICPADS.2015.80

Filename

7384343