Speaker
Description
We would like to present a short introduction to the ALICE Analysis Facility and WSCLAB (Wigner Scientific Computing Laboratory) projects in our datacenter and show some key operation and visibility details of monitoring. Hardware components are aging, so monitoring is an important method to keep infrastructure healthy and to prolong cluster lifetime.
We created server types (worker node, storage) and defined entities in our monitoring system. In some cases, monitoring checks are just basic, others are advanced and some are even more complex to make sure we know the most important details in almost real time.
For power consumption we are using a visualization solution for power usage statistics based on each rack.
Ansible automation tool was used to scale up the monitoring system.
Historical data is also very valuable, so we integrated a database solution (InfluxDB) into our monitoring workflow.
Current milestones and roadmap for monitoring: continuous disk tests (S.M.A.R.T.), smart alerting for complex cases, scheduled backup for monitoring data, proper alerting based on pre-defined warning and critical levels, iterative time-based optimization for running checks, HTCondor service monitoring.