• DocumentCode
    416101
  • Title

    Improved file synchronization techniques for maintaining large replicated collections over slow networks

  • Author

    Suel, Torsten ; Noel, Patrick ; Trendafilov, Dimitre

  • Author_Institution
    CIS Dept., Polytech. Univ. Brooklyn, NY, USA
  • fYear
    2004
  • fDate
    30 March-2 April 2004
  • Firstpage
    153
  • Lastpage
    164
  • Abstract
    We study the problem of maintaining large replicated collections of files or documents in a distributed environment with limited bandwidth. This problem arises in a number of important applications, such as synchronization of data between accounts or devices, content distribution and Web caching networks, Web site mirroring, storage networks, and large scale Web search and mining. At the core of the problem lies the following challenge, called the file synchronization problem: given two versions of a file on different machines, say an outdated and a current one, how can we update the outdated version with minimum communication cost, by exploiting the significant similarity between the versions? While a popular open source tool for this problem called rsync is used in hundreds of thousands of installations, there have been only very few attempts to improve upon this tool in practice. We propose a framework for remote file synchronization and describe several new techniques that result in significant bandwidth savings. Our focus is on applications where very large collections have to be maintained over slow connections. We show that a prototype implementation of our framework and techniques achieves significant improvements over rsync. As an example application, we focus on the efficient synchronization of very large Web page collections for the purpose of search, mining, and content distribution.
  • Keywords
    Internet; bandwidth allocation; cache storage; replicated databases; synchronisation; very large databases; Web caching networks; Web mining; Web page collection; Web search; Web site mirroring; content distribution; distributed environment; file synchronization technique; large replicated document collection; large replicated file collection; limited bandwidth; open source tool; slow networks; storage networks; Bandwidth; Computational Intelligence Society; Costs; Engineering profession; Internet; Large-scale systems; Prototypes; Search engines; Web pages; Web search;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2004. Proceedings. 20th International Conference on
  • ISSN
    1063-6382
  • Print_ISBN
    0-7695-2065-0
  • Type

    conf

  • DOI
    10.1109/ICDE.2004.1319992
  • Filename
    1319992