• DocumentCode
    2923088
  • Title

    Latent fault detection in large scale services

  • Author

    Gabel, Moshe ; Schuster, Assaf ; Bachrach, Ran-Gilad ; Bjørner, Nikolaj

  • Author_Institution
    Dept. of Comput. Sci., Technion - Israel Inst. of Technol., Haifa, Israel
  • fYear
    2012
  • fDate
    25-28 June 2012
  • Firstpage
    1
  • Lastpage
    12
  • Abstract
    Unexpected machine failures, with their resulting service outages and data loss, pose challenges to datacenter management. Existing failure detection techniques rely on domain knowledge, precious (often unavailable) training data, textual console logs, or intrusive service modifications. We hypothesize that many machine failures are not a result of abrupt changes but rather a result of a long period of degraded performance. This is confirmed in our experiments, in which over 20% of machine failures were preceded by such latent faults. We propose a proactive approach for failure prevention. We present a novel framework for statistical latent fault detection using only ordinary machine counters collected as standard practice. We demonstrate three detection methods within this framework. Derived tests are domain-independent and unsupervised, require neither background information nor tuning, and scale to very large services. We prove strong guarantees on the false positive rates of our tests.
  • Keywords
    Web services; learning (artificial intelligence); software fault tolerance; statistical analysis; Web services; data loss; datacenter management; distributed computing; failure detection techniques; failure prevention; large scale services; service outages; statistical analysis; statistical latent fault detection; statistical learning; unexpected machine failures; Fault detection; Hardware; Monitoring; Radiation detectors; Support vector machines; Tuning; Vectors; distributed computing; fault detection; statistical analysis; statistical learning; web services;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems and Networks (DSN), 2012 42nd Annual IEEE/IFIP International Conference on
  • Conference_Location
    Boston, MA
  • ISSN
    1530-0889
  • Print_ISBN
    978-1-4673-1624-8
  • Electronic_ISBN
    1530-0889
  • Type

    conf

  • DOI
    10.1109/DSN.2012.6263932
  • Filename
    6263932