DocumentCode :
3376688
Title :
Improving system health monitoring with better error processing
Author :
Kain, Brian ; Ozarin, Nathaniel
Author_Institution :
Omnicon Group Inc., Hauppauge, NY, USA
fYear :
2011
fDate :
20-23 June 2011
Firstpage :
1
Lastpage :
8
Abstract :
To help identify unexpected software events and impending hardware failures, developers typically incorporate error-checking code in their software to detect and report them. Unfortunately, implementing checks with reporting capabilities that give the most useful results comes at a price. Such capabilities should report the exact nature of impending failures and additionally limit reporting to only the first occurrence of an error to prevent flooding the error log with the same message. They must report when an existing error or fault is replaced by another error of a different nature or value. They must recognize what makes occasional faults allowable and they must reset themselves upon recovery from a reported failure so the checking process can begin anew. They must also report recovery from previously reported failures that appear to have healed themselves. Since the price associated with providing all these features is limited by budget and schedule, system reliability and health monitoring often suffer. However, there are practical techniques that can simplify the effort associated with incorporating such error detection and reporting. When done properly, they can greatly improve system reliability and health monitoring by finding potentially hidden problems during development and can also greatly improve system maintainability by providing concise running descriptions of problems when things go wrong particularly when minor errors might otherwise go unnoticed. In addition, preventative maintenance can be greatly aided by applying error detection techniques to performance monitoring in the absence of errors. Many of the techniques described in this paper take advantage of simple classes to do bookkeeping tasks such as updating and tracking statistical analysis of errors and error reporting. The paper highlights several of these classes and gives examples from actual applications.
Keywords :
error analysis; software fault tolerance; statistical analysis; system recovery; allowable faults; error checking code; error detection; error processing; error reporting; hardware failures; statistical error analysis; system health monitoring; system reliability; unexpected software events; Hardware; Maintenance engineering; Monitoring; Reliability; Software; Temperature measurement; Temperature sensors; BIT; software health; software monitoring;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Prognostics and Health Management (PHM), 2011 IEEE Conference on
Conference_Location :
Montreal, QC
Print_ISBN :
978-1-4244-9828-4
Type :
conf
DOI :
10.1109/ICPHM.2011.6024322
Filename :
6024322
Link To Document :
بازگشت