Centralised Monitoring and Logging

A centralised monitoring and logging solution was designed to eliminate the need for having many dashboards and services spread across different datacentres to access monitoring while enabling the aggregation of data, centralising monitoring dashboards and alerting and secure communication with monitoring systems with zero trust principles. All SKAO datacentres used for testing, staging, integration or production have been fully integrated into this solution.

Developer friendly Dashboards

To address the evolving needs of developers, several new Grafana dashboards have been introduced. These dashboards provide detailed insights into Kubernetes resource usage, CI/CD pipeline statuses, and namespace management. Developers can now easily monitor their deployments, resource utilization, and logs through these comprehensive dashboards.

These dashbaord can be accessed by following the links printed out in Gitlab CI Job output as the example figure shown below. These are prepopulated with the namespace and time range of the job so that the user can easily monitor the deployment.

Grafana Dashboard Links

Dashboard Links

Real-Time Dashboards

These dashboards are currently only available for CI/CD jobs running in STFC. They are based on Headlamp which is graphical user interface specifically tailored for simplfying the monitoring of Kubernetes Deployments. It allows real time monitoring of the deployments such as pod status, custom resource definitions such as TangoDBs and investigate the deployment events, logs and metadata.

Logs

There are prepoulated and filtered log views as shown in the above figure for Job, Test Pod, Namespace and more deployment related logs. These are useful for debugging and monitoring the deployment logs. As Kibana URLs are hard to generate and would fail if they don’t exist, please follow the URLs in the job output to access the logs.

Kubernetes Dashboards

The dashboards provide detailed insights into resource utilization across the deployments. Developers can monitor deployment health, CPU, memory, and storage usage, and identify any resource constraints. This helps in optimizing resource allocation and ensuring that applications run smoothly. These dashboards are also enriched with log integration to provide a comprehensive view of the deployments.

They can be found by following the URLs in the Gitlab CI Job output. Some of the dashboards are shown below. Note that you need to filter these for your specific namespace and time range.

CI/CD pipeline Dashboards

The dashboards provide detailed insights into the CI/CD pipeline statuses. Developers can monitor the pipeline health, job statuses, and identify any issues. This helps in optimizing the pipeline and ensuring that the pipelines are healthy.

They can be found by following the URLs below. Note that you need to filter these for your specific namespace and time range.

Namespace Management Dashboards

The dashboards provide detailed insights into the namespace management. Developers can monitor the namespace health, resource utilization, and identify any issues. This helps in optimizing the namespace and ensuring that the namespaces are healthy. They also give an overview on the namespace usage and the resources allocated to the namespace per gitlab project, team or user.

They can be found by following the URLs below.

Monitoring Solution

Prometheus and Thanos

The central monitoring solution is based on Prometheus, integrated with Thanos, providing high-availability and long-term storage capabilities while allowing for the data aggregation from multiple Prometheus targets.

Grafana

To monitor SKA Infrastructure related metrics from, for example, Kubernetes, Gitlab Runners, Elasticstack or Ceph, Grafana dashboards should be used.

Info

To log in, choose the “Sign in with Azure AD” option and use the <jira-username>@ad.skatelescope.org and <jira-password> combination. Once logged in, users can browse through the existing dashboards and monitor the desired metrics.

STFC Dashboards Browsing page

STFC Dashboards Browsing page

Users can also create their own dashboards and share them.

Dashboard sharing example

New Dashboard Sharing example

Prometheus Alerts

To check the prometheus alerts, generated for the core kubernetes cluster and the infrastructure VMs, a user can choose between the web access to the Prometheus Alert Manager UI and the Slack alerts channels.

The URLs to access the Prometheus Alert Manager are:

STFC Alert Manager homepage

STFC Alert Manager homepage

It is important to note that these URLs are behind a VPN, so VPN access to the corresponding datacentre is required to access them.

There are also two sets of Slack alerts channels, one that serves application alerts and another that serves developer related alerts. These are:

Logging Solution

Filebeat and Elasticsearch

The central logging solution is based on Filebeat, collecting logs from the referred datacentres and shipping them to Elasticsearch.

Kibana

Info

To log in to Kibana, open the URL https://k8s.stfc.skao.int/kibana/app/logs/stream, choose the “Sign in with Azure AD” option, use your <jira-username>@ad.skatelescope.org and <jira-password> combination to log in, and after logging in, choose the option “Continue as Guest” to access Kibana.

Kibana allows for filtering of log messages on the basis of a series of fields. These fields can be added as columns to display information, using the Settings option, and filtering by the values of those fields can be done directly on the Search box or by selecting the View details menu:

kibana log stream, selecting "view details" for a particular CI pipeline

In the example above in order to retrieve only the log messages relevant for the skampi development pipeline ci-skampi-st-605-mid, one should then select the corresponding kubernetes.namespace field value.

Kibana Log event document details, selecting the kubernetes.namespace

There are many other field options using kubernetes information, for example kubernetes.node.name and kubernetes.pod.name, that can be used for efficient filtering.

The fact the SKA logging format allows for simple key-value pairs (SKA Tags) to be included in log messages let us refine the filtering. Tags are parsed to a field named ska_tags and on this field there can be one or more device properties separated by commas.

logs for the specified namespace

The field ska_tags is also parsed so that the key is added to a ska_tags_field prefix that will store the value. For the example above, this means filtering the messages using the value of the ska_tags_field.tango-device field.

selecting ska-tags to look at tango-device log messages

Making the selection illustrated above means that only messages with the value ska_mid/tm_leaf_node/d0003 for the ska_tags_field.tango-device field would be displayed.