• DocumentCode
    3226347
  • Title

    Sequence of Hashes Compression in Data De-duplication

  • Author

    Balachandran, Subashini ; Constantinescu, Cornel

  • Author_Institution
    IBM Almaden Res. Center, San Jose
  • fYear
    2008
  • fDate
    25-27 March 2008
  • Firstpage
    505
  • Lastpage
    505
  • Abstract
    Data de-duplication is a simple compression method, popular in storage archival and backup that consists in partitioning large data objects (files) into smaller parts (named chunks), and replacing the chunks for the purpose of communication or storage by their ID, generally a cryptographic hash like SHA-1 of the chunk data [A. Muthitacharoen et al., 2001], [D.R. Bobbarjung et al., 2006]. The compression ratio achieved by de-duplication can be improved by (1) increasing the likelihood of matching the new chunks against the dictionary (archived) chunks and/or (2) compressing the list of hashes (indexes, of 20 bytes each). Using smaller chunk sizes increases the chance of matching but many more hashes will be generated. The chunks repository is a hash table where each entry stores the SHA-1 value of the chunk and the chunk data. In addition, with each newly created entry we store a chronological pointer linking it with the next new entry. When the hashes produced by the chunker follow the chronological pointers we encode them as a sequence of hashes by specifying the first hash in the sequence and the length of the sequence or when the same hash is generated repeatedly we encode it as a run of hashes by specifying its value and the number of repeated occurrences. The usefulness of the chronological pointers is derived from the insight that when archiving successive versions of a file or set of files, large contiguous areas remain unchanged between these versions and the chronological pointers are predictors of this contiguity. If the contiguity is broken there is a small loss in the hash sequence compression.
  • Keywords
    cryptography; data compression; chronological pointer linking; chunk data; chunk repository; cryptographic hash; data de-duplication; hash sequence compression ratio; hash table; storage archival; Cryptography; Data compression; Dictionaries; Joining processes; Operating systems; Data De-duplication; cryptographic hashes compression;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference, 2008. DCC 2008
  • Conference_Location
    Snowbird, UT
  • ISSN
    1068-0314
  • Print_ISBN
    978-0-7695-3121-2
  • Type

    conf

  • DOI
    10.1109/DCC.2008.80
  • Filename
    4483332