DocumentCode
2331292
Title
Efficient Similarity Estimation for Systems Exploiting Data Redundancy
Author
Tangwongsan, Kanat ; Pucha, Himabindu ; Andersen, David G. ; Kaminsky, Michael
Author_Institution
Carnegie Mellon Univ., Pittsburgh, PA, USA
fYear
2010
fDate
14-19 March 2010
Firstpage
1
Lastpage
9
Abstract
Many modern systems exploit data redundancy to improve efficiency. These systems split data into chunks, generate identifiers for each of them, and compare the identifiers among other data items to identify duplicate chunks. As a result, chunk size becomes a critical parameter for the efficiency of these systems: it trades potentially improved similarity detection (smaller chunks) with increased overhead to represent more chunks. Unfortunately, the similarity between files increases unpredictably with smaller chunk sizes, even for data of the same type. Existing systems often pick one chunk size that is "good enough\´\´ for many cases because they lack efficient techniques to determine the benefits at other chunk sizes. This paper addresses this deficiency via two contributions: (1) we present multi-resolution (MR) handprinting, an application-independent technique that efficiently estimates similarity between data items at different chunk sizes using a compact, multi-size representation of the data; (2) we then evaluate the application of MR handprints to workloads from peer-to-peer, file transfer, and storage systems, demonstrating that the chunk size selection enabled by MR handprints can lead to real improvements over using a fixed chunk size in these systems.
Keywords
data analysis; data structures; chunk size; data redundancy; file transfer; multiresolution handprinting; similarity detection; similarity estimation; storage systems; Communications Society; Motion pictures; Navigation; Peer to peer computing; Protocols; Redundancy; Springs; Streaming media; Web pages; Wide area networks;
fLanguage
English
Publisher
ieee
Conference_Titel
INFOCOM, 2010 Proceedings IEEE
Conference_Location
San Diego, CA
ISSN
0743-166X
Print_ISBN
978-1-4244-5836-3
Type
conf
DOI
10.1109/INFCOM.2010.5461965
Filename
5461965
Link To Document