Sequence of Hashes Compression in Data De-duplication

Author

Balachandran, Subashini ; Constantinescu, Cornel

Author_Institution

IBM Almaden Res. Center, San Jose

fYear

2008

fDate

25-27 March 2008

Firstpage

505

Lastpage

505

Abstract

Data de-duplication is a simple compression method, popular in storage archival and backup that consists in partitioning large data objects (files) into smaller parts (named chunks), and replacing the chunks for the purpose of communication or storage by their ID, generally a cryptographic hash like SHA-1 of the chunk data [A. Muthitacharoen et al., 2001], [D.R. Bobbarjung et al., 2006]. The compression ratio achieved by de-duplication can be improved by (1) increasing the likelihood of matching the new chunks against the dictionary (archived) chunks and/or (2) compressing the list of hashes (indexes, of 20 bytes each). Using smaller chunk sizes increases the chance of matching but many more hashes will be generated. The chunks repository is a hash table where each entry stores the SHA-1 value of the chunk and the chunk data. In addition, with each newly created entry we store a chronological pointer linking it with the next new entry. When the hashes produced by the chunker follow the chronological pointers we encode them as a sequence of hashes by specifying the first hash in the sequence and the length of the sequence or when the same hash is generated repeatedly we encode it as a run of hashes by specifying its value and the number of repeated occurrences. The usefulness of the chronological pointers is derived from the insight that when archiving successive versions of a file or set of files, large contiguous areas remain unchanged between these versions and the chronological pointers are predictors of this contiguity. If the contiguity is broken there is a small loss in the hash sequence compression.

Keywords

cryptography; data compression; chronological pointer linking; chunk data; chunk repository; cryptographic hash; data de-duplication; hash sequence compression ratio; hash table; storage archival; Cryptography; Data compression; Dictionaries; Joining processes; Operating systems; Data De-duplication; cryptographic hashes compression;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Compression Conference, 2008. DCC 2008

Conference_Location

Snowbird, UT

ISSN

1068-0314

Print_ISBN

978-0-7695-3121-2

Type

conf

DOI

10.1109/DCC.2008.80

Filename

4483332