Speaker
Description
We would like to present a short introduction to the ALICE Analysis Facility and WSCLAB projects in our datacenter. After introducing the ALICE Analysis Facility, we are discussing the corresponding IT infrastructure design and capacity in detail.
Hardware components are aging, so monitoring is an important method to keep infrastructure healthy and to prolong cluster lifetime.
We selected a monitoring solution fitting for these environments. We created server types (worker node, storage) and defined entities in our monitoring system. In some cases, monitoring checks are just basic, others are advanced and some are even more complex to make sure we know the most important details in almost real time.
We created a separate VLAN network for monitoring in order to minimize interference with real worker node traffic.
Power consumption and electricity bills are important factors nowadays. In order to see the details, we are using a visualization solution for power usage statistics based on each rack.
To scale up the monitoring system efficiently, we are using automation tools for node preparation and installation.
Historical data is also very valuable, so we integrated a database solution into our monitoring workflow.
Our roadmap for future developments includes: continuous disk tests (S.M.A.R.T.), scheduled backup for monitoring data, proper alerting based on pre-defined warning and critical levels, iterative time-based optimization for running checks, HTCondor service monitoring, monitoring GPU RAM, GPU-utilization and temperature for GPULAB and smart alerting for complex cases.