Title :
Use of Expert system and Data Analysis Technologies in automation of error detection, diagnosis and recovery for ATLAS Trigger-DAQ Control framework
Author :
Kazarov, A. ; Radu, A.C. ; Magnoni, Luca ; Miotto, G.L.
Author_Institution :
Petersburg Nucl. Phys. Inst., Kurchatov NPI, Gatchina, Russia
Abstract :
Trigger and Data Acquisition (TDAQ) System of the ATLAS experiment on LHC at CERN is a very complex distributed computing system, composed of O(10000) applications running on a farm of commodity CPUs. The system is being designed and developed by dozens of software engineers and physicists since end of 1990´s and it will be maintained in operational mode during the lifetime of the experiment. The TDAQ system is controlled by the Control framework, which includes a set of software components and tools used for system configuration, distributed processes handling, synchronization of Run Control state transitions etc. The huge flow of operational monitoring data produced is constantly monitored by operators and experts in order to detect problems or misbehavior. Given the scale of the system and the rates of data to be analyzed, the automation of the Control framework functionality in the areas of operational monitoring, system verification, error detection and recovery is a strong requirement. The paper describes requirements, technologies choice, high-level design and some implementation aspects of advanced Control tools based on knowledge-base technologies. The main aim of these tools is to store and to reuse developers expertise and operational knowledge in order to help TDAQ operators to control the system with maximum efficiency during life time of the experiment.
Keywords :
computerised monitoring; data acquisition; data analysis; distributed processing; error detection; expert systems; high energy physics instrumentation computing; multiprocessing systems; nuclear electronics; physical instrumentation control; position sensitive particle detectors; synchronisation; trigger circuits; ATLAS experiment; ATLAS trigger-DAQ control framework; CERN; CPU; LHC; TDAQ operators; TDAQ system; advanced control tools; complex distributed computing system; data analysis; distributed processes handling; error detection; error recovery; expert system; high-level design; knowledge-base technologies; operational monitoring data; run control state transitions; software components; Automation; Control systems; Engines; Expert systems; Monitoring; Software;
Conference_Titel :
Real Time Conference (RT), 2012 18th IEEE-NPSS
Conference_Location :
Berkeley, CA
Print_ISBN :
978-1-4673-1082-6
DOI :
10.1109/RTC.2012.6418364