Author_Institution :
Microsoft Online Ads Platform, Microsoft Corp., Redmond, WA, USA
Abstract :
System logs are an important tool in studying the conditions (e.g., environment misconfigurations, resource status, erroneous user input) that cause failures. However, production system logs are complex, verbose, and lack structural stability over time. These traits make them hard to use, and make solutions that rely on them susceptible to high maintenance costs. Additionally, logs record failures after they occur: by the time logs are investigated, users have already experienced the failures´ consequences. To detect the environment conditions that are correlated with failures without dealing with the complexities associated with processing production logs, and to prevent failure-causing conditions from occurring before the system goes live, this research suggests a three step methodology: (i) using synthetic transactions, i.e., simplified workloads, in pre-production environments that emulate user behavior, (ii) recording the result of executing these transactions in logs that are compact, simple to analyze, stable over time, and specifically tailored to the fault metrics of interest, and (iii) mining these specialized logs to understand the conditions that correlate to failures. This allows system administrators to configure the system to prevent these conditions from happening. We evaluate the effectiveness of this approach by replicating the behavior of a service used in production at Microsoft, and testing the ability to predict failures using a synthetic workload on a 650 million events production trace. The synthetic prediction system is able to predict 91% of real production failures using 50-fold fewer transactions and logs that are 10,000-fold more compact than their production counterparts.
Keywords :
data mining; fault diagnosis; transaction processing; failure avoidance; fault metric; fault prediction; maintenance cost; online service; structural stability; synthetic prediction system; synthetic transaction; system log; user behavior; Data mining; Production; Reliability; Servers; Software; Time factors; Time measurement; Failure prediction; data analysis; data mining; failure avoidance; synthetic transactions; system logs;