• DocumentCode
    3459783
  • Title

    File Deduplication with Cloud Storage File System

  • Author

    Chan-I Ku ; Guo-Heng Luo ; Che-Pin Chang ; Shyan-Ming Yuan

  • Author_Institution
    Degree Program of ECE & CS Colleges, Nat. Chiao Tung Univ., Hsinchu, Taiwan
  • fYear
    2013
  • fDate
    3-5 Dec. 2013
  • Firstpage
    280
  • Lastpage
    287
  • Abstract
    The Hadoop Distributed File System (HDFS) is used to solve the storage problem of huge data, but does not provide a handling mechanism of duplicate files. In this study, the middle layer file system in the HBASE virtual architecture is used to do File Deduplicate in HDFS, with two architectures proposed according to different requires of the applied requirement reliability, therein one is RFD-HDFS (Reliable File Deduplicated HDFS) which is not permitted to have any errors and the other is FD-HDFS (File Deduplicated HDFS) which can tolerate very few errors. In addition to the advantage of the space complexity, the marginal benefits from it are explored. Assuming a popular video is uploaded to HDFS by one million users, through the Hadoop replication, they are divided into three million files to store, that is a practice wasting disk space very much and only by the cloud to remove repeats for effectively loading. By that, only three file spaces are taken up, namely the 100% utility of removing duplicate files reaches. The experimental architecture is a cloud based documentation system, like the version of EndNote Cloud, to simulate the cluster effect of massive database when the researcher synchronized the data with cloud storage.
  • Keywords
    cloud computing; data handling; storage management; EndNote Cloud; HBASE virtual architecture; Hadoop distributed file system; Hadoop replication; RFD-HDFS; cloud based documentation system; cloud storage file system; cluster effect; disk space; duplicate file removal; duplicate files handling mechanism; file deduplication; huge data storage problem; massive database; middle layer file system; reliable file deduplicated HDFS; repeat removal; requirement reliability; space complexity; Bandwidth; Cloud computing; Computer architecture; File systems; Google; Reliability; Writing; Cloud Computing; Data Deduplication; HDFS; Single instance storage;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Science and Engineering (CSE), 2013 IEEE 16th International Conference on
  • Conference_Location
    Sydney, NSW
  • Type

    conf

  • DOI
    10.1109/CSE.2013.52
  • Filename
    6755230