• DocumentCode
    170450
  • Title

    SAP: Similarity-aware partitioning for efficient cloud storage

  • Author

    Balasubramanian, Balamurugan ; Tian Lan ; Mung Chiang

  • Author_Institution
    Princeton Univ., Princeton, NJ, USA
  • fYear
    2014
  • fDate
    April 27 2014-May 2 2014
  • Firstpage
    592
  • Lastpage
    600
  • Abstract
    Given a set of files that show a certain degree of similarity, we consider a novel problem of deduplicating them (eliminating redundant chunks) across a set of distributed servers in a manner that is: (i) space-efficient: the total space needed to deduplicate and store the files is minimized and, (ii) access-efficient: each file can be accessed by communicating with a bounded number of servers, thereby minimizing network-access times in congested data center networks. A space-optimal solution in which we first deduplicate all the files and then distribute them across the servers (referred to as chunk-distribution), may require communication with many servers to access each file. On the other hand, an access-efficient solution in which we randomly partition the files cross the servers, and then store their unique chunks on each server may not exploit the similarities across files to reduce the space overhead. In this paper, we first show that finding an access-efficient, space optimal solution is an NP-Hard problem. Following this, we present the similarity-aware-partitioning (SAP) algorithms that find access-efficient solutions within polynomial time complexity and guarantees bounded space overhead for arbitrary files. Our experimental verification on files from Dropbox and CNN confirm that the SAP technique is much more space-efficient than random partitioning, while maintaining compression ratio close to the chunk-distribution solution.
  • Keywords
    cloud computing; computational complexity; computer centres; data communication; file servers; storage management; CNN; Dropbox; NP-hard problem; SAP technique; arbitrary files; bounded space overhead; chunk distribution; chunk storage; cloud storage; data center network congestion; deduplicating problem; distributed server; efficient optimal solution access; file access; file distribution; file storage; network access time minimization; polynomial time complexity; random partition; similarity aware partitioning; space optimal solution; Computers; Conferences; Partitioning algorithms; Polynomials; Servers; Upper bound; Writing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    INFOCOM, 2014 Proceedings IEEE
  • Conference_Location
    Toronto, ON
  • Type

    conf

  • DOI
    10.1109/INFOCOM.2014.6847984
  • Filename
    6847984