• DocumentCode
    2839868
  • Title

    Semantic Data De-duplication for archival storage systems

  • Author

    Liu, Chuanyi ; Ju, Dapeng ; Gu, Yu ; Zhang, Youhui ; Wang, Dongsheng ; Du, David H C

  • Author_Institution
    Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing
  • fYear
    2008
  • fDate
    4-6 Aug. 2008
  • Firstpage
    1
  • Lastpage
    9
  • Abstract
    In archival storage systems, there is a huge amount of duplicate data or redundant data, which occupy significant extra equipments and power consumptions, largely lowering down resources utilization (such as the network bandwidth and storage) and imposing extra burden on management as the scale increases. So data de-duplication, the goal of which is to minimize the duplicate data in the inter-file level, has been receiving broad attention both in academic and industry in recent years. In this paper, semantic data de-duplication (SDD) is proposed, which makes use of the semantic information in the I/O path (such as file type, file format, application hints and filesystem metadata) of the archival files to direct the dividing a file into semantic chunks (SC). While the main goal of SDD is to maximally reduce the inter-file level duplications, directly storing variable SCes into disks will result in a lot of fragments and involve a high percentage of random disk accesses, which is very inefficient. So an efficient data storage scheme is also designed and implemented: SCes are further packaged into fixed sized Objects, which are actually the storage units in the storage devices, so as to speed up the I/O performance as well as ease the data management. Primary experiments have demonstrated that SDD can further reduce the storage space compared with current methods (from 20% to near 50% according to different datasets), and largely improves the writing performance (about 50%-70% in average).
  • Keywords
    information retrieval systems; records management; storage management; archival storage systems; data storage; redundant data; resources utilization; semantic chunks; semantic data deduplication; semantic information; Application software; Bandwidth; Data engineering; Energy consumption; File servers; Fingerprint recognition; Head; Network servers; Packaging; Power engineering and energy;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Systems Architecture Conference, 2008. ACSAC 2008. 13th Asia-Pacific
  • Conference_Location
    Hsinchu
  • Print_ISBN
    978-1-4244-2682-9
  • Electronic_ISBN
    978-1-4244-2683-6
  • Type

    conf

  • DOI
    10.1109/APCSAC.2008.4625441
  • Filename
    4625441