• DocumentCode
    3387
  • Title

    Similarity and Locality Based Indexing for High Performance Data Deduplication

  • Author

    Wen Xia ; Hong Jiang ; Dan Feng ; Yu Hua

  • Author_Institution
    Wuhan Nat. Lab. for Optoelectron., Huazhong Univ. of Sci. & Technol., Wuhan, China
  • Volume
    64
  • Issue
    4
  • fYear
    2015
  • fDate
    Apr-15
  • Firstpage
    1162
  • Lastpage
    1176
  • Abstract
    Data deduplication has gained increasing attention and popularity as a space-efficient approach in backup storage systems. One of the main challenges for centralized data deduplication is the scalability of fingerprint-index search. In this paper, we propose SiLo, a near-exact and scalable deduplication system that effectively and complementarily exploits similarity and locality of data streams to achieve high duplicate elimination, throughput, and well balanced load at extremely low RAM overhead. The main idea behind SiLo is to expose and exploit more similarity by grouping strongly correlated small files into a segment and segmenting large files, and to leverage the locality in the data stream by grouping contiguous segments into blocks to capture similar and duplicate data missed by the probabilistic similarity detection. SiLo also employs a locality based stateless routing algorithm to parallelize and distribute data blocks to multiple backup nodes. By judiciously enhancing similarity through the exploitation of locality and vice versa, SiLo is able to significantly reduce RAM usage for index-lookup, achieve the near-exact efficiency of duplicate elimination, maintain a high deduplication throughput, and obtain load balance among backup nodes.
  • Keywords
    database indexing; meta data; resource allocation; RAM overhead; RAM usage reduction; SiLo; backup nodes; backup storage systems; centralized data deduplication; contiguous segment grouping; data block distribution; data block parallelization; data stream locality; data stream similarity; deduplication throughput; duplicate elimination; fingerprint-index search scalability; high-performance data deduplication; index-lookup; large-file segmentation; load balancing; locality based indexing; locality based stateless routing algorithm; locality leveraging; near-exact efficiency; near-exact-scalable deduplication system; probabilistic similarity detection; similarity based indexing; space-efficient approach; strongly-correlated small-file grouping; Indexing; Probabilistic logic; Random access memory; Scalability; Servers; Throughput; Data deduplication; index structure; performance evaluation; storage system;
  • fLanguage
    English
  • Journal_Title
    Computers, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9340
  • Type

    jour

  • DOI
    10.1109/TC.2014.2308181
  • Filename
    6747963