Centralised Monitoring and Logging
A centralised monitoring and logging solution was designed to eliminate the need for having many dashboards and services spread across different datacentres to access monitoring while enabling the aggregation of data, centralising monitoring dashboards and alerting and secure communication with monitoring systems with zero trust principles. All SKAO datacentres used for testing, staging, integration or production have been fully integrated into this solution.
Developer friendly Dashboards
To address the evolving needs of developers, several new Grafana dashboards have been introduced. These dashboards provide detailed insights into Kubernetes resource usage, CI/CD pipeline statuses, and namespace management. Developers can now easily monitor their deployments, resource utilization, and logs through these comprehensive dashboards.
These dashbaord can be accessed by following the links printed out in Gitlab CI Job output as the example figure shown below. These are prepopulated with the namespace and time range of the job so that the user can easily monitor the deployment.
Real-Time Dashboards
These dashboards are currently only available for CI/CD jobs running in STFC. They are based on Headlamp which is graphical user interface specifically tailored for simplfying the monitoring of Kubernetes Deployments. It allows real time monitoring of the deployments such as pod status, custom resource definitions such as TangoDBs and investigate the deployment events, logs and metadata.
Tango DeviceServers: These are specific to the Tango Controls Applications
DatabaseDS: These are specific to the Tango Controls Applications
Logs
There are prepoulated and filtered log views as shown in the above figure for Job, Test Pod, Namespace and more deployment related logs. These are useful for debugging and monitoring the deployment logs. As Kibana URLs are hard to generate and would fail if they don’t exist, please follow the URLs in the job output to access the logs.
Kubernetes Dashboards
The dashboards provide detailed insights into resource utilization across the deployments. Developers can monitor deployment health, CPU, memory, and storage usage, and identify any resource constraints. This helps in optimizing resource allocation and ensuring that applications run smoothly. These dashboards are also enriched with log integration to provide a comprehensive view of the deployments.
They can be found by following the URLs in the Gitlab CI Job output. Some of the dashboards are shown below. Note that you need to filter these for your specific namespace and time range.
CI/CD pipeline Dashboards
The dashboards provide detailed insights into the CI/CD pipeline statuses. Developers can monitor the pipeline health, job statuses, and identify any issues. This helps in optimizing the pipeline and ensuring that the pipelines are healthy.
They can be found by following the URLs below. Note that you need to filter these for your specific namespace and time range.
Namespace Management Dashboards
The dashboards provide detailed insights into the namespace management. Developers can monitor the namespace health, resource utilization, and identify any issues. This helps in optimizing the namespace and ensuring that the namespaces are healthy. They also give an overview on the namespace usage and the resources allocated to the namespace per gitlab project, team or user.
They can be found by following the URLs below.
Monitoring Solution
Prometheus and Thanos
The central monitoring solution is based on Prometheus, integrated with Thanos, providing high-availability and long-term storage capabilities while allowing for the data aggregation from multiple Prometheus targets.
Grafana
To monitor SKA Infrastructure related metrics from, for example, Kubernetes, Gitlab Runners, Elasticstack or Ceph, Grafana dashboards should be used.
STFC Metrics URL: https://k8s.stfc.skao.int/grafana/ (until migration is complete)
Info
To log in, choose the “Sign in with Azure AD” option and use the <jira-username>@ad.skatelescope.org and <jira-password> combination. Once logged in, users can browse through the existing dashboards and monitor the desired metrics.
Users can also create their own dashboards and share them.
Prometheus Alerts
To check the prometheus alerts, generated for the core kubernetes cluster and the infrastructure VMs, a user can choose between the web access to the Prometheus Alert Manager UI and the Slack alerts channels.
The URLs to access the Prometheus Alert Manager are:
STFC datacentre - http://monitoring.skao.stfc:9093/#/alerts
DP datacentre - http://monitoring.sdhp.skao:9093/#/alerts
It is important to note that these URLs are behind a VPN, so VPN access to the corresponding datacentre is required to access them.
There are also two sets of Slack alerts channels, one that serves application alerts and another that serves developer related alerts. These are:
- STFC datacentre
Application alerts - #techops-alerts
Developer alerts - #techops-user-alerts
- DP datacentre
Application alerts - #dp-platform-alerts
Developer alerts - #dp-platform-user-alerts
Logging Solution
Filebeat and Elasticsearch
The central logging solution is based on Filebeat, collecting logs from the referred datacentres and shipping them to Elasticsearch.
Kibana
Info
To log in to Kibana, open the URL https://k8s.stfc.skao.int/kibana/app/logs/stream, choose the “Sign in with Azure AD” option, use your <jira-username>@ad.skatelescope.org and <jira-password> combination to log in, and after logging in, choose the option “Continue as Guest” to access Kibana.
Kibana allows for filtering of log messages on the basis of a series of fields. These fields can be added as columns to display information, using the Settings option, and filtering by the values of those fields can be done directly on the Search box or by selecting the View details menu:
In the example above in order to retrieve only the log messages relevant for the skampi development pipeline ci-skampi-st-605-mid
, one should then select the corresponding kubernetes.namespace
field value.
There are many other field options using kubernetes information, for example kubernetes.node.name
and kubernetes.pod.name
, that can be used for efficient filtering.
The fact the SKA logging format allows for simple key-value pairs (SKA Tags) to be included in log messages let us refine the filtering. Tags are parsed to a field named ska_tags
and on this field there can be one or more device properties separated by commas.
The field ska_tags
is also parsed so that the key is added to a ska_tags_field
prefix that will store the value. For the example above, this means filtering the messages using the value of the ska_tags_field.tango-device
field.
Making the selection illustrated above means that only messages with the value ska_mid/tm_leaf_node/d0003
for the ska_tags_field.tango-device
field would be displayed.