Centralised Monitoring, Logging & Debugging

A centralised monitoring and logging solution was designed to eliminate the need for having many dashboards and services spread across different datacentres. It enables the aggregation of data, centralising monitoring dashboards with logging and alerting with secure communication with monitoring systems with zero trust principles. All SKAO datacentres used for testing, staging, integration or production have been fully integrated into this solution. Also services that facilitate debugging of applications and infrastructure are available in some datacentres.

Note

The usability of the logging system heavily relies on the quality of the logged data. Please make sure your application adheres to the SKAO Logging Format

Note that not all environments use the same logging or monitoring service nor have the same services deployed. Some of these are not and will not be available in production environments. Below you can find the relevant URLs:

Datacentre

Monitoring

Logging

stfc-techops (cicd)

https://monitoring.skao.int

https://k8s.stfc.skao.int/kibana

stfc-dp (cicd)

https://monitoring.skao.int

https://k8s.stfc.skao.int/kibana

aws-*

https://monitoring.skao.int

https://k8s.stfc.skao.int/kibana

mid-itf/low-itf

https://monitoring.skao.int

https://k8s.stfc.skao.int/kibana

mid-aa

https://monitoring.skao.int

https://k8s.mid.internal.skao.int/kibana

low-aa

https://monitoring.skao.int

https://k8s.low.internal.skao.int/kibana

Note

If your environment requires local access, please follow this confluence page

Datacentre

Headlamp

Coder

stfc-techops (cicd)

https://k8s.stfc.skao.int/headlamp

https://coder.k8s.stfc.skao.int/login

stfc-dp (cicd

https://sdhp.stfc.skao.int/headlamp

N/A

aws-*

N/A

N/A

mid-itf/low-itf

  • Local access

N/A

mid-aa

  • Local access

N/A

low-aa

  • Local access

N/A

Developer centered tools

To address the evolving needs of developers, logs, real-time cluster access and custom Grafana dashboards are available. These tools provide detailed insights into Kubernetes resource usage, CI/CD pipeline statuses, and namespace management. Developers can now easily monitor their deployments, resource utilisation and logs through these comprehensive dashboards.

Pipeline job artefacts

In order to provide test-scoped information - the only scenario in the pipeline where we have a clear start and end point in time - we are also providing logs and Pod descriptions in the pipeline’s jobs’ artefacts.

Pipeline job artefacts

Pipeline job artefacts

These contain:

  • Kubernetes events: A dump of all the events in the relevant namespace

  • Kubernetes Pod descriptions: The output of kubectl describe of every Pod in the namespace, useful to understand why the system is not becoming ready (_i.e._, image errors, missing secrets or volumes, _etc_)

  • Kubernetes Pod logs: Logs of all the pods

To pull all the files, do:

$ curl -L https://gitlab.com/ska-telescope/ska-ser-oci-daemon/-/jobs/9819118142/artifacts/download
$ # curl -L <gitlab job url>/artifacts/download -o job.zip
$ mkdir -p artefacts
$ cd artefacts
$ unzip ../job.zip
$ ls -R

Which yields the following locally available to developers for further investigation as they’d like:

.:
k8s-logs  pip_list.txt  status

./k8s-logs:
describe  k8s-events.log  logs

./k8s-logs/describe:
test-makefile-runner-9819118142-describe.txt                 test-ska-ser-oci-daemon-node-4kgrk-describe.txt
test-mirror-registry-a-777cfff67d-grzht-describe.txt         test-ska-ser-oci-daemon-node-4q7p6-describe.txt
test-mirror-registry-b-df496f665-5wfxn-describe.txt          test-ska-ser-oci-daemon-node-tbppr-describe.txt
test-ska-ser-oci-daemon-cache-6fc8f7549c-ntltg-describe.txt  test-ska-ser-oci-daemon-node-zvp45-describe.txt

./k8s-logs/logs:
test-makefile-runner-9819118142-logs.txt                 test-ska-ser-oci-daemon-node-4kgrk-logs.txt
test-mirror-registry-a-777cfff67d-grzht-logs.txt         test-ska-ser-oci-daemon-node-4q7p6-logs.txt
test-mirror-registry-b-df496f665-5wfxn-logs.txt          test-ska-ser-oci-daemon-node-tbppr-logs.txt
test-ska-ser-oci-daemon-cache-6fc8f7549c-ntltg-logs.txt  test-ska-ser-oci-daemon-node-zvp45-logs.txt

Understand what is deployed

Together with the pipeline links, we also provide information on what is being deployed into Kubernetes clusters. Again, this requires the pipeline machinery integration.

***Gathering information for namespace: ci-ska-tango-examples-7cabaa1f***
OCI images for pod databaseds-tangodb-tango-databaseds-0:
     artefact.skao.int/ska-tango-images-tango-db:11.0.2
OCI images for pod ska-tango-base-itango-console:
     artefact.skao.int/ska-tango-images-tango-itango:9.5.0
OCI images for pod theexample-admin-test-6477f9cdb5-lcktq:
     docker.io/alpine:3.12

Installed Helm charts:
   test:
      Chart: ska-tango-examples-test-parent-0.1.16
      App Version: 0.1.16
   Dependencies:
      * ska-dashboard-repo @ 0.1.9
      * ska-tango-base @ 0.4.16 | 0.4.18 Available!
      * ska-tango-examples @ 0.5.1 | Local (file://../ska-tango-examples)
      * ska-tango-taranta @ 2.8.3 | 2.14.1 Available!
      * ska-tango-taranta-auth @ 0.2.2 | 0.2.5 Available!
      * ska-tango-util @ 0.4.16 | 0.4.18 Available!

Note that we output the images for pods, as well as the installed Helm chart and its dependencies. If available we also display the latest version of a given chart (only for SKAO charts). The same information is also available in Headlamp’s namespace view, as shown below:

Headlamp namespace release information

Headlamp namespace release information

The chart version and outdated information is very useful and makes it easy for developers to know when there are newer versions of their dependencies ready for consumption.

Understand deployment status

Sometimes we deploy an application but we forget to monitor its health, specially when it is a long-running environment that is not under active use. Nonetheless, that environment is running and should be healthy, otherwise it is taking resources away from other deployments. To overcome that, we’ve introduced the SKA Namespace Manager - a service that actively monitors namespaces. This service evaluates the health of each namespace every minute and notifies - if ownership metadata is available - its users.

Marvin Namespace Manager

Marvin Namespace Manager

As you can see, Marvin can alert via Slack Direct Messages the owner of the namespaces about changes in their health. We can see various alerts, the affected resources and - when applicable - suggestions and runbooks to help you resolve the issue. Critically, it links to the job that deployed the namespace, making it extremely easy to find other related links, like previously shown.

Headlamp namespace status

Headlamp namespace status

Also in our cluster access solution, the status and some related information is shown.

Make targets

To power some of the solutions mentioned previously, some new make targets were added or improved:

  • KUBE_NAMESPACE=<namespace> HELM_RELEASE=<release> make k8s-namespace-info -> Outputs all Pods and images for all the containers, together with Helm chart release dependencies, as shown here

  • make k8s-namespace-links -> Creates and outputs the prebuilt URLs. Requires some runner-specific environment variables to operate properly

  • VERBOSE_WAIT=true KUBE_NAMESPACE=<namespace> make k8s-podlogs -> Outputs the logs for all init-containers and pods in a namespace, similar to the pipeline artefacts. To use it in the pipeline to show logs when k8s-wait fails, set VERBOSE_WAIT=true

Note that when integrated with the pipeline machinery these are already used, called and their contents placed in the most convenient way.

CI/CD pipeline Dashboards

The dashboards provide detailed insights into the CI/CD pipeline statuses. Developers can monitor the pipeline health, job statuses, and identify any issues. This helps in optimising the pipeline and ensuring that the pipelines are healthy.

They can be found by following the URLs below. Note that you need to filter these for your specific namespace and time range.

CI/CD usage Dashboards

Using Kibana, we were able to provide dashboards that show the usage of the clusters around the project on all clusters that communicate with the central logging.

The last dashboard in particular was instrumental in finding a software component (mainly by analysing the Line locations view) that was extremely verbose and causing load issues in Elasticsearch.

Namespace Management Dashboards

The dashboards provide detailed insights into the namespace management. Developers can monitor the namespace health, resource utilisation and identify any issues. This helps in optimising the namespace and ensuring that the namespaces are healthy. They also provide an overview on the namespace usage and the resources allocated to the namespace per Gitlab project, team or user.

They can be found by following the URLs below.

Logging solution

Logging in SKA is handled with Elasticsearch, bundled with Kibana as a frontend. This frontend is more suitable to creating visualisations than actually searching logs. As mentioned earlier, we provide prepopulated URLs to various useful queries like test pod logs, Device Server configuration logs, or the whole namespace.

The usefulness of the tools comes from the queries we can make. Here are some examples:

Also, some visualisations on project namespace usage around our Kubernetes clusters.

How to

You can learn more about how to work with the logging for efficient log parsing for debug purposes.

Monitoring solution

Monitoring in SKA is handled with Prometheus, bundled with Grafana as a frontend. The metrics collected by Prometheus can be used to:

  • Create dashboards that provide detailed insights into the health and status of infrastructure and deployments

  • Create alarms to monitor health and status of infrastructure and deployments

Some extremely useful dashboards were created so that developers can monitor deployment health, CPU, memory, and storage usage, and identify any resource constraints. This helps in optimizing resource allocation and ensuring that applications run smoothly. Some of these dashboards are also enriched with log integration to provide a comprehensive view of the deployments, which allows for a more in-depth analysis of application behavior.

They can be found by following the URLs in the Gitlab CI Job output. Some of the dashboards are shown below. Note that you need to filter these for your specific namespace and time range.

How to

You can learn more about how to work with the monitoring tools to understand and improve the behaviour of your applications.

Headlamp: Real-Time dashboards & Cluster access

To provide web-based real-time access to the Kubernetes clusters we deploy Headlamp. It is a graphical user interface specifically tailored for simplifying the monitoring of Kubernetes clusters. It allows real-time monitoring of the deployments such as pod’s status, custom resource definitions such as TangoDBs and investigate the deployment events, logs and metadata. This is not available in equal form for all clusters, so the mentioned dashboards might not be accessible.

Headlamp Home

Headlamp Home

This tool is similar to K9s in the sense that provides a graphical visualisation of the resources available in the Kubernetes API. On top of the default tools, we provide views that link with the rest of the provided solutions.

Headlamp namespace cicd metadata

Headlamp namespace cicd metadata

This table in the namespace page provide several metadata on the namespace as the project or team that deployed it. Also, links to the job that deployed this namespace are available, which in turn lead you to all of the prebuilt logging and monitoring logs.

Also, we provide some summary views on the SKAO usage (teams, projects and users) of the given Kubernetes cluster:

Headlamp team namespace usage

Headlamp team namespace usage

Coder: Remote debugging solution

When we deploy applications to Kubernetes clusters, it is hard to do debugging when there is no access to the cluster or we can’t run a debugging session. Luckily, we have Coder. It is a self-hosted cloud development environment that integrates with multiple IDEs, providing secure access to remote development environments.

After logging in with Gitlab, you can create a workspace in the Kubernetes cluster using the “Kubernetes” template:

Coder workspace creation

Coder workspace creation

After giving it a name and compute requirements (ie: CPU, RAM and Disk), you can launch it. When it launches, you can access it in multiple ways:

Coder connection options

Coder connection options

You can access it with:

  • VS Code Desktop app

  • Browser-based JupyterLab

  • Browser-based VS Code

  • Browser-based terminal

  • SSH

Coder Jupyterlab terminal

Coder Jupyterlab terminal

From the terminal (using any connection option) we have access to the Kubernetes cluster with limited access, but it is possible to view Pod’s logs, describe pods and other resources. Other tools like Helm, K9s or tango_admin are also available.

Note that, as you are accessing the cluster directly, we have access to all the logs available and are NOT limited by logging or monitoring data’s retention policies.

How to

You can learn more about how to leverage Coder to debug your application while it is running in the Kubernetes cluster.