DocumentCode
2839868
Title
Semantic Data De-duplication for archival storage systems
Author
Liu, Chuanyi ; Ju, Dapeng ; Gu, Yu ; Zhang, Youhui ; Wang, Dongsheng ; Du, David H C
Author_Institution
Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing
fYear
2008
fDate
4-6 Aug. 2008
Firstpage
1
Lastpage
9
Abstract
In archival storage systems, there is a huge amount of duplicate data or redundant data, which occupy significant extra equipments and power consumptions, largely lowering down resources utilization (such as the network bandwidth and storage) and imposing extra burden on management as the scale increases. So data de-duplication, the goal of which is to minimize the duplicate data in the inter-file level, has been receiving broad attention both in academic and industry in recent years. In this paper, semantic data de-duplication (SDD) is proposed, which makes use of the semantic information in the I/O path (such as file type, file format, application hints and filesystem metadata) of the archival files to direct the dividing a file into semantic chunks (SC). While the main goal of SDD is to maximally reduce the inter-file level duplications, directly storing variable SCes into disks will result in a lot of fragments and involve a high percentage of random disk accesses, which is very inefficient. So an efficient data storage scheme is also designed and implemented: SCes are further packaged into fixed sized Objects, which are actually the storage units in the storage devices, so as to speed up the I/O performance as well as ease the data management. Primary experiments have demonstrated that SDD can further reduce the storage space compared with current methods (from 20% to near 50% according to different datasets), and largely improves the writing performance (about 50%-70% in average).
Keywords
information retrieval systems; records management; storage management; archival storage systems; data storage; redundant data; resources utilization; semantic chunks; semantic data deduplication; semantic information; Application software; Bandwidth; Data engineering; Energy consumption; File servers; Fingerprint recognition; Head; Network servers; Packaging; Power engineering and energy;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer Systems Architecture Conference, 2008. ACSAC 2008. 13th Asia-Pacific
Conference_Location
Hsinchu
Print_ISBN
978-1-4244-2682-9
Electronic_ISBN
978-1-4244-2683-6
Type
conf
DOI
10.1109/APCSAC.2008.4625441
Filename
4625441
Link To Document