Title :
Using Inter-file Similarity to Improve Intra-file Compression
Author :
Molfetas, Angelos ; Wirth, Andreas ; Zobel, Justin
Author_Institution :
Dept. of Comput. & Inf. Syst., Univ. of Melbourne, Melbourne, VIC, Australia
fDate :
June 27 2014-July 2 2014
Abstract :
In storage systems with vast numbers of files, compression techniques should exploit of inter-file similarity, while allowing for near-atomic access to individual files. In differential compression, collections of files are compressed by identifying shared common strings. Therefore, some files are represented largely by references to strings in other files. In addition, a file in the collection can be (further) compressed by identifying common strings within the file itself. At the cost of decompression latency, but a possible gain in compression effectiveness, an LZ-style within-file compressor could resolve these references to other files. To quantify the compression gain, we experiment with a variety of file collections, from emails to source code, and test against multiple measures. If the LZ scheme honors the inter-file references, then there is only minimal improvement. If the LZ algorithm replaces inter-file references with intra-file references, then up to 3% compression improvement is witnessed for mildly similar files, and over 200% improvement for highly similar files.
Keywords :
data compression; source code (software); storage management; LZ algorithm; LZ-style within-file compressor; compression effectiveness; decompression latency; differential compression; e-mails; file collection compression; interfile similarity; intrafile compression; intrafile references; near-atomic access; shared common strings; source code; storage systems; Compression algorithms; Dictionaries; Electronic mail; Encoding; Encyclopedias; Indexes; Measurement; Differential compression; LZ factorization;
Conference_Titel :
Big Data (BigData Congress), 2014 IEEE International Congress on
Conference_Location :
Anchorage, AK
Print_ISBN :
978-1-4799-5056-0
DOI :
10.1109/BigData.Congress.2014.35