• DocumentCode
    3537774
  • Title

    Droplet: A Distributed Solution of Data Deduplication

  • Author

    Zhang, Yang ; Wu, Yongwei ; Yang, Guangwen

  • Author_Institution
    Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
  • fYear
    2012
  • fDate
    20-23 Sept. 2012
  • Firstpage
    114
  • Lastpage
    121
  • Abstract
    Creating backup copies is the most commonly used technique to protect from data loss. In order to increase reliability, doing routinely backup is a best practice. Such backup activities will create multiple redundant data streams which is not economic to be directly stored on disk. Similarly, enterprise archival systems usually deal with redundant data, which needs to be stored for later accessing. Deduplication is an essential technique used under these situations, which could avoid storing identical data segments, and thus saves a significant portion of disk usage. Also, recent studies have shown that deduplication could also effectively reduce the disk space used to store virtual machine (VM) disk images. We present droplet, a distributed deduplication storage system that has been designed for high throughput and scalability. Droplet strips input data streams onto multiple storage nodes, thus limits number of stored data segments on each node and ensures the fingerprint index could be fitted into memory. The in-memory finger index avoids the disk bottleneck discussed in Data Domain, ChunkStash and provides excellent lookup performance. The buffering layer in droplet provides good write performance for small data segments. Compression on date segments reduces disk usage one step further.
  • Keywords
    back-up procedures; buffer storage; client-server systems; data compression; disc storage; virtual machines; VM disk images; backup copy creation; buffering layer; date segment compression; disk bottleneck; disk space reduction; disk usage reduction; distributed data deduplication solution; distributed deduplication storage system; droplet system; enterprise archival systems; fingerprint index; in-memory finger index; input data streams; lookup performance; multiple redundant data streams; scalability; small data segments; storage nodes; stored data segments; throughput; virtual machine disk images; write performance; Containers; Fingerprint recognition; Indexes; Random access memory; Servers; Throughput; Virtual machining; cluster; deduplication; storage system;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Grid Computing (GRID), 2012 ACM/IEEE 13th International Conference on
  • Conference_Location
    Beijing
  • ISSN
    1550-5510
  • Print_ISBN
    978-1-4673-2901-9
  • Type

    conf

  • DOI
    10.1109/Grid.2012.21
  • Filename
    6319161