• DocumentCode
    3515680
  • Title

    Effectiveness of machine checks for error diagnostics

  • Author

    Pandit, Nikhil ; Kalbarczyk, Zbigniew ; Iyer, Ravishankar K.

  • Author_Institution
    Center for Reliable & High-Performance Comput., Univ. of Illinois at Urbana Champaign, Urbana, IL, USA
  • fYear
    2009
  • fDate
    June 29 2009-July 2 2009
  • Firstpage
    578
  • Lastpage
    583
  • Abstract
    Machine Check Architecture (MCA) is a processor internal architecture subsystem that detects and logs correctable and uncorrectable errors in the data or control paths in each CPU core and the Northbridge. These errors include parity errors associated with caches, TLBs, ECC errors associated with caches and DRAM, and system bus errors. This paper reports on an experimental study on: (i) monitoring a computing cluster for machine checks and using this data to identify patterns that can be employed for error diagnostics and (ii) introducing faults into the machine to understand the resulting machine checks and correlate this data with relevant performance metrics.
  • Keywords
    cache storage; data flow analysis; fault diagnosis; CPU core; DRAM; ECC error; Northbridge; TLB; bus error; computing cluster monitoring; control path; data path; machine check architecture; parity error detection; processor internal architecture subsystem; uncorrectable error diagnostics; Availability; Cloud computing; Computer architecture; Delay; Error correction; Error correction codes; Hardware; Kernel; Linux; Random access memory;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems & Networks, 2009. DSN '09. IEEE/IFIP International Conference on
  • Conference_Location
    Lisbon
  • Print_ISBN
    978-1-4244-4422-9
  • Electronic_ISBN
    978-1-4244-4421-2
  • Type

    conf

  • DOI
    10.1109/DSN.2009.5270290
  • Filename
    5270290