• DocumentCode
    180720
  • Title

    The case for sampling on very large file systems

  • Author

    Goldberg, George ; Harnik, Danny ; Sotnikov, Dmitry

  • Author_Institution
    IBM Res., Haifa, Israel
  • fYear
    2014
  • fDate
    2-6 June 2014
  • Firstpage
    1
  • Lastpage
    11
  • Abstract
    Sampling has long been a prominent tool in statistics and analytics, first and foremost when very large amounts of data are involved. In the realm of very large file systems (and hierarchical data stores in general), however, sampling has mostly been ignored and for several good reasons. Mainly, running sampling in such an environment introduces technical challenges that make the entire sampling process non-beneficial. In this work we demonstrate that there are cases for which sampling is very worthwhile in very large file systems. We address this topic in two aspect: (a) the technical side where we design and implement solutions to efficient weighted sampling that is also distributed, one-pass and addresses multiple efficiency aspects; and (b) the usability aspect in which we demonstrate several use-cases in which weighted sampling over large file systems is extremely beneficial. In particular, we show use-cases regarding estimation of compression ratios, testing and auditing and offline collection of statistics on very large data stores.
  • Keywords
    data compression; file organisation; sampling methods; compression ratios; efficiency aspect; offline statistics collection; sampling process; usability aspect; very large file systems; weighted sampling; Accuracy; Algorithm design and analysis; Estimation; File systems; Radiation detectors; Speech; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Mass Storage Systems and Technologies (MSST), 2014 30th Symposium on
  • Conference_Location
    Santa Clara, CA
  • Type

    conf

  • DOI
    10.1109/MSST.2014.6855542
  • Filename
    6855542