• DocumentCode
    2873799
  • Title

    Protecting against rare event failures in archival systems

  • Author

    Wildani, Avani ; Schwarz, Thomas J E ; Miller, Ethan L. ; Long, Darrell D E

  • Author_Institution
    Storage Syst. Res. Center, Univ. of California, Santa Cruz, CA, USA
  • fYear
    2009
  • fDate
    21-23 Sept. 2009
  • Firstpage
    1
  • Lastpage
    11
  • Abstract
    Digital archives are growing rapidly, necessitating stronger reliability measures than RAID to avoid data loss from device failure. Mirroring, a popular solution, is too expensive over time. We present a compromise solution that uses multi-level redundancy coding to reduce the probability of data loss from multiple simultaneous device failures. This approach handles small-scale failures of one or two devices efficiently while still allowing the system to survive rare-event, larger-scale failures of four or more devices. In our approach, each disk is split into a set of fixed size disklets which are used to construct reliability stripes. To protect against rare event failures, reliability stripes are grouped into larger super-groups, each of which has a corresponding super-parity; super-parity is only used to recover data when disk failures overwhelm the redundancy in a single reliability stripe. Super-parity can be stored on a variety of devices such as NV-RAM and always-on disks to offset write bottlenecks while still keeping the number of active devices low. Our calculations of failure probabilities show that adding super-parity allows our system to absorb many more disk failures without data loss. Through discrete event simulation, we found that adding super-groups has a significant impact on mean time to data loss and that rebuilds are slow but not unmanageable. Finally, we showed that robustness against rare events can be achieved for a fraction of total system cost.
  • Keywords
    discrete event simulation; information retrieval systems; records management; NV-RAM; archival systems; digital archives; discrete event simulation; rare event failure protection; rare event failures; super-parity; Cooling; Costs; Data engineering; Insurance; Power system protection; Power system reliability; Redundancy; Reliability engineering; Robustness; Surge protection;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Modeling, Analysis & Simulation of Computer and Telecommunication Systems, 2009. MASCOTS '09. IEEE International Symposium on
  • Conference_Location
    London
  • ISSN
    1526-7539
  • Print_ISBN
    978-1-4244-4927-9
  • Electronic_ISBN
    1526-7539
  • Type

    conf

  • DOI
    10.1109/MASCOT.2009.5366825
  • Filename
    5366825