• DocumentCode
    2845627
  • Title

    Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study

  • Author

    Gu, Jiexing ; Zheng, Ziming ; Lan, Zhiling ; White, John ; Hocks, Eva ; Park, Byung-Hoon

  • Author_Institution
    Illinois Inst. of Technol., Chicago, IL
  • fYear
    2008
  • fDate
    9-12 Sept. 2008
  • Firstpage
    157
  • Lastpage
    164
  • Abstract
    Despite great efforts on the design of ultra-reliable components, the increase of system size and complexity has outpaced the improvement of component reliability. As a result, fault management becomes crucial in high performance computing. The advance of fault management relies on effective failure prediction. Despite years of research on failure prediction, it remains an open problem, especially in large-scale systems. In this paper, we address the problem by presenting a dynamic meta-learning prediction engine. It extends our previous work by exploring dynamic training, testing and prediction. Here, the "dynamic" part is from two perspectives: one is to continuously increase the training set during the system operation; and the other is to dynamically modify the rules of failure patterns by tracing prediction accuracy at runtime. Our case study indicates that the proposed predictor is promising by being capable of capturing more than 70% of failures, with the false alarm rate less than 10%.
  • Keywords
    fault tolerant computing; large-scale systems; learning (artificial intelligence); component reliability; dynamic meta-learning prediction; dynamic training; failure prediction; fault management; high performance computing; large-scale systems; Accuracy; Checkpointing; Data mining; Engines; Fault tolerance; High performance computing; Large-scale systems; Predictive models; Resilience; Runtime; Blue Gene/L; failure prediction; meta-learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel Processing, 2008. ICPP '08. 37th International Conference on
  • Conference_Location
    Portland, OR
  • ISSN
    0190-3918
  • Print_ISBN
    978-0-7695-3374-2
  • Electronic_ISBN
    0190-3918
  • Type

    conf

  • DOI
    10.1109/ICPP.2008.17
  • Filename
    4625845