Title :
Monitoring High Performance Networks in Large-scale Clusters
Author_Institution :
Commissariat a l´´Energie Atomique, Paris
Abstract :
The number of large-scale clusters is rising. They are included into grids or become key components of large structures. As more users and projects rely on RFC clusters, high availability and security are requirements for a fast growing adoption and use. In this paper, we, focus on high performance networks. All HPC clusters are built on top of them. We demonstrate that classical instrumentations are inefficient in HPC environment, they do not scale or cause a significant loss of performance. Based on this fact, we highlight clusters properties; nodes have assigned roles and are coupled at various levels. Moreover, we study the main characteristics of resource usage for each type of node and propose an instrumentation that can be effectively deployed. It results in fine-grained mechanisms adapted to system architecture, and performance constraints. Relevant information is collected over time. Two properties are verified online and dynamically: coherency and containment. Each induces a type of verification and both aim at reducing recovery time from failure and security risk of a whole cluster. We illustrate our methodology on QsNet by K. Magontis et al. (2001) network and provide a way to increase safety of high performance networks and clusters
Keywords :
program verification; resource allocation; security of data; system recovery; telecommunication security; workstation clusters; HPC clusters; RFC clusters; availability requirements; coherency verification; containment verification; failure risk; high performance networks; large-scale clusters; model checking; performance constraints; recovery time reduction; resource usage; security requirements for; security risk; system architecture; Computer industry; Condition monitoring; Grid computing; High performance computing; Instruments; Intelligent networks; Kernel; Large-scale systems; Message passing; Safety;
Conference_Titel :
Cluster Computing and the Grid, 2006. CCGRID 06. Sixth IEEE International Symposium on
Conference_Location :
Singapore
Print_ISBN :
0-7695-2585-7
DOI :
10.1109/CCGRID.2006.1630927