• DocumentCode
    2821745
  • Title

    MAD2: A scalable high-throughput exact deduplication approach for network backup services

  • Author

    Wei, Jiansheng ; Jiang, Hong ; Zhou, Ke ; Feng, Dan

  • Author_Institution
    Sch. of Comput., Huazhong Univ. of Sci. & Technol., Wuhan, China
  • fYear
    2010
  • fDate
    3-7 May 2010
  • Firstpage
    1
  • Lastpage
    14
  • Abstract
    Deduplication has been widely used in disk-based secondary storage systems to improve space efficiency. However, there are two challenges facing scalable high-throughput deduplication storage. The first is the duplicate-lookup disk bottleneck due to the large size of data index that usually exceeds the available RAM space, which limits the deduplication throughput. The second is the storage node island effect resulting from duplicate data among multiple storage nodes that are difficult to eliminate. Existing approaches fail to completely eliminate the duplicates while simultaneously addressing the challenges. This paper proposes MAD2, a scalable high-throughput exact deduplication approach for network backup services. MAD2 eliminates duplicate data both at the file level and at the chunk level by employing four techniques to accelerate the deduplication process and evenly distribute data. First, MAD2 organizes fingerprints into a Hash Bucket Matrix (HBM), whose rows can be used to preserve the data locality in backups. Second, MAD2 uses Bloom Filter Array (BFA) as a quick index to quickly identify non-duplicate incoming data objects or indicate where to find a possible duplicate. Third, Dual Cache is integrated in MAD2 to effectively capture and exploit data locality. Finally, MAD2 employs a DHT-based Load-Balance technique to evenly distribute data objects among multiple storage nodes in their backup sequences to further enhance performance with a well-balanced load. We evaluate our MAD2 approach on the backend storage of B-Cloud, a research-oriented distributed system that provides network backup services. Experimental results show that MAD2 significantly outperforms the state-of-the-art approximate deduplication approaches in terms of deduplication efficiency, supporting a deduplication throughput of at least 100MB/s for each storage component.
  • Keywords
    disc storage; distributed databases; random-access storage; resource allocation; HT based load balance technique; MAD2; RAM; bloom filter array; disk based secondary storage system; dual cache; duplicate lookup disk bottleneck; hash bucket matrix; network backup service; research oriented distributed system; scalable high throughput exact deduplication; storage node island effect; Acceleration; Computer networks; Costs; Fingerprint recognition; Laboratories; Network servers; Peer to peer computing; Scalability; Space technology; Throughput;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on
  • Conference_Location
    Incline Village, NV
  • Print_ISBN
    978-1-4244-7152-2
  • Electronic_ISBN
    978-1-4244-7153-9
  • Type

    conf

  • DOI
    10.1109/MSST.2010.5496987
  • Filename
    5496987