• DocumentCode
    3448288
  • Title

    Accurate fault prediction of BlueGene/P RAS logs via geometric reduction

  • Author

    Thompson, Joshua ; Dreisigmeyer, David W. ; Jones, Terry ; Kirby, Michael ; Ladd, Joshua

  • Author_Institution
    Dept. of Math., Colorado State Univ., Fort Collins, CO, USA
  • fYear
    2010
  • fDate
    June 28 2010-July 1 2010
  • Firstpage
    8
  • Lastpage
    14
  • Abstract
    This investigation presents two distinct and novel approaches for the prediction of system failures occurring in Oak Ridge National Laboratory´s Blue Gene/P supercomputer. Each technique uses raw numeric and textual subsets of large data logs of physical system information such as fan speeds and CPU temperatures. This data is used to develop models of the system capable of sensing anomalies, or deviations from nominal behavior. Each algorithm predicted event log reported anomalies in advance of their occurrence and one algorithm did so without false positives. Both algorithms predicted an anomaly that did not appear in the event log. It was later learned that the fault missing from the log but predicted by both algorithms was confirmed to have occurred by the system administrator.
  • Keywords
    fault diagnosis; mainframes; system recovery; BlueGene/P; CPU temperatures; RAS logs; data logs; fan speeds; fault prediction; geometric reduction; numeric subsets; physical system information; supercomputer; system administrator; system failures prediction; textual subsets; Hardware; High performance computing; Information analysis; Laboratories; Machine learning; Mathematics; Prediction algorithms; Supercomputers; Switches; Telecommunication switching; MSET; NMF; fault prediction; high performance computing; resiliency;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems and Networks Workshops (DSN-W), 2010 International Conference on
  • Conference_Location
    Chicago, IL
  • Print_ISBN
    978-1-4244-7729-6
  • Electronic_ISBN
    978-1-4244-7728-9
  • Type

    conf

  • DOI
    10.1109/DSNW.2010.5542626
  • Filename
    5542626