Centralised Monitoring, Logging & Debugging
A centralised monitoring and logging solution was designed to eliminate the need for having many dashboards and services spread across different datacentres. It enables the aggregation of data, centralising monitoring dashboards with logging and alerting with secure communication with monitoring systems with zero trust principles. All SKAO datacentres used for testing, staging, integration or production have been fully integrated into this solution. Also services that facilitate debugging of applications and infrastructure are available in some datacentres.
Note
The usability of the logging system heavily relies on the quality of the logged data. Please make sure your application adheres to the SKAO Logging Format
Note that not all environments use the same logging or monitoring service nor have the same services deployed. Some of these are not and will not be available in production environments. Below you can find the relevant URLs:
Datacentre |
Monitoring |
Logging |
stfc-techops (cicd) |
||
stfc-dp (cicd) |
||
aws-* |
||
mid-itf/low-itf |
||
mid-aa |
||
low-aa |
Note
If your environment requires local access, please follow this confluence page
Datacentre |
Headlamp |
Coder |
stfc-techops (cicd) |
||
stfc-dp (cicd |
N/A |
|
aws-* |
N/A |
N/A |
mid-itf/low-itf |
|
N/A |
mid-aa |
|
N/A |
low-aa |
|
N/A |
Developer centered tools
To address the evolving needs of developers, logs, real-time cluster access and custom Grafana dashboards are available. These tools provide detailed insights into Kubernetes resource usage, CI/CD pipeline statuses, and namespace management. Developers can now easily monitor their deployments, resource utilisation and logs through these comprehensive dashboards.
Pipeline links
Links to these resources are displayed in Gitlab CI Job output (example). These are prepopulated with the metadata and time range of the job so that the user can easily monitor the deployment. Note that for some long-running namespaces (like integration and staging, the end date in the dashboards is probably later than what is set).
Note
To be able to have these prepopulated links in your project you need to integrate with the pipeline machinery.
Gitlab pipeline general links
Gitlab pipeline logs links
Gitlab pipeline monitoring links
In this list, we provide:
Headlamp URL for the namespace
Pipeline log files for all namespace’s pods, as well as pod descriptions
Kibana links for the namespace, test pods, Device Server configuration logs, etc
Grafana links for namespace, Device Servers and other compute-related dashboards
When using these URLS, note:
Monitoring and logging data are only available for a certain retention period. After that period, the queries will return nothing
Kibana URLs must be copied and pasted into the browser, as clicking them doesn’t work properly due to their length
Merge Request links
The aforementioned links are also provided automatically in your merge requests. Using an ska-tango-examples merge request as an example:
Marvin MR links
This table neatly scopes the available links we display in the job’s logs. Note that you have an entry per job, with the relevant deployment type, namespace and the target cluster.
Pipeline job artefacts
In order to provide test-scoped information - the only scenario in the pipeline where we have a clear start and end point in time - we are also providing logs and Pod descriptions in the pipeline’s jobs’ artefacts.
Pipeline job artefacts
These contain:
Kubernetes events: A dump of all the events in the relevant namespace
Kubernetes Pod descriptions: The output of
kubectl describeof every Pod in the namespace, useful to understand why the system is not becoming ready (_i.e._, image errors, missing secrets or volumes, _etc_)Kubernetes Pod logs: Logs of all the pods
To pull all the files, do:
$ curl -L https://gitlab.com/ska-telescope/ska-ser-oci-daemon/-/jobs/9819118142/artifacts/download
$ # curl -L <gitlab job url>/artifacts/download -o job.zip
$ mkdir -p artefacts
$ cd artefacts
$ unzip ../job.zip
$ ls -R
Which yields the following locally available to developers for further investigation as they’d like:
.:
k8s-logs pip_list.txt status
./k8s-logs:
describe k8s-events.log logs
./k8s-logs/describe:
test-makefile-runner-9819118142-describe.txt test-ska-ser-oci-daemon-node-4kgrk-describe.txt
test-mirror-registry-a-777cfff67d-grzht-describe.txt test-ska-ser-oci-daemon-node-4q7p6-describe.txt
test-mirror-registry-b-df496f665-5wfxn-describe.txt test-ska-ser-oci-daemon-node-tbppr-describe.txt
test-ska-ser-oci-daemon-cache-6fc8f7549c-ntltg-describe.txt test-ska-ser-oci-daemon-node-zvp45-describe.txt
./k8s-logs/logs:
test-makefile-runner-9819118142-logs.txt test-ska-ser-oci-daemon-node-4kgrk-logs.txt
test-mirror-registry-a-777cfff67d-grzht-logs.txt test-ska-ser-oci-daemon-node-4q7p6-logs.txt
test-mirror-registry-b-df496f665-5wfxn-logs.txt test-ska-ser-oci-daemon-node-tbppr-logs.txt
test-ska-ser-oci-daemon-cache-6fc8f7549c-ntltg-logs.txt test-ska-ser-oci-daemon-node-zvp45-logs.txt
Understand what is deployed
Together with the pipeline links, we also provide information on what is being deployed into Kubernetes clusters. Again, this requires the pipeline machinery integration.
***Gathering information for namespace: ci-ska-tango-examples-7cabaa1f***
OCI images for pod databaseds-tangodb-tango-databaseds-0:
artefact.skao.int/ska-tango-images-tango-db:11.0.2
OCI images for pod ska-tango-base-itango-console:
artefact.skao.int/ska-tango-images-tango-itango:9.5.0
OCI images for pod theexample-admin-test-6477f9cdb5-lcktq:
docker.io/alpine:3.12
Installed Helm charts:
test:
Chart: ska-tango-examples-test-parent-0.1.16
App Version: 0.1.16
Dependencies:
* ska-dashboard-repo @ 0.1.9
* ska-tango-base @ 0.4.16 | 0.4.18 Available!
* ska-tango-examples @ 0.5.1 | Local (file://../ska-tango-examples)
* ska-tango-taranta @ 2.8.3 | 2.14.1 Available!
* ska-tango-taranta-auth @ 0.2.2 | 0.2.5 Available!
* ska-tango-util @ 0.4.16 | 0.4.18 Available!
Note that we output the images for pods, as well as the installed Helm chart and its dependencies. If available we also display the latest version of a given chart (only for SKAO charts). The same information is also available in Headlamp’s namespace view, as shown below:
Headlamp namespace release information
The chart version and outdated information is very useful and makes it easy for developers to know when there are newer versions of their dependencies ready for consumption.
Understand deployment status
Sometimes we deploy an application but we forget to monitor its health, specially when it is a long-running environment that is not under active use. Nonetheless, that environment is running and should be healthy, otherwise it is taking resources away from other deployments. To overcome that, we’ve introduced the SKA Namespace Manager - a service that actively monitors namespaces. This service evaluates the health of each namespace every minute and notifies - if ownership metadata is available - its users.
Marvin Namespace Manager
As you can see, Marvin can alert via Slack Direct Messages the owner of the namespaces about changes in their health. We can see various alerts, the affected resources and - when applicable - suggestions and runbooks to help you resolve the issue. Critically, it links to the job that deployed the namespace, making it extremely easy to find other related links, like previously shown.
Headlamp namespace status
Also in our cluster access solution, the status and some related information is shown.
Make targets
To power some of the solutions mentioned previously, some new make targets were added or improved:
KUBE_NAMESPACE=<namespace> HELM_RELEASE=<release> make k8s-namespace-info-> Outputs all Pods and images for all the containers, together with Helm chart release dependencies, as shown heremake k8s-namespace-links-> Creates and outputs the prebuilt URLs. Requires some runner-specific environment variables to operate properlyVERBOSE_WAIT=true KUBE_NAMESPACE=<namespace> make k8s-podlogs-> Outputs the logs for all init-containers and pods in a namespace, similar to the pipeline artefacts. To use it in the pipeline to show logs when k8s-wait fails, setVERBOSE_WAIT=true
Note that when integrated with the pipeline machinery these are already used, called and their contents placed in the most convenient way.
CI/CD pipeline Dashboards
The dashboards provide detailed insights into the CI/CD pipeline statuses. Developers can monitor the pipeline health, job statuses, and identify any issues. This helps in optimising the pipeline and ensuring that the pipelines are healthy.
They can be found by following the URLs below. Note that you need to filter these for your specific namespace and time range.
CI/CD usage Dashboards
Using Kibana, we were able to provide dashboards that show the usage of the clusters around the project on all clusters that communicate with the central logging.
The last dashboard in particular was instrumental in finding a software component (mainly by analysing the Line locations view) that was extremely verbose and causing load issues in Elasticsearch.
Namespace Management Dashboards
The dashboards provide detailed insights into the namespace management. Developers can monitor the namespace health, resource utilisation and identify any issues. This helps in optimising the namespace and ensuring that the namespaces are healthy. They also provide an overview on the namespace usage and the resources allocated to the namespace per Gitlab project, team or user.
They can be found by following the URLs below.
Logging solution
Logging in SKA is handled with Elasticsearch, bundled with Kibana as a frontend. This frontend is more suitable to creating visualisations than actually searching logs. As mentioned earlier, we provide prepopulated URLs to various useful queries like test pod logs, Device Server configuration logs, or the whole namespace.
The usefulness of the tools comes from the queries we can make. Here are some examples:
Also, some visualisations on project namespace usage around our Kubernetes clusters.
How to
You can learn more about how to work with the logging for efficient log parsing for debug purposes.
Monitoring solution
Monitoring in SKA is handled with Prometheus, bundled with Grafana as a frontend. The metrics collected by Prometheus can be used to:
Create dashboards that provide detailed insights into the health and status of infrastructure and deployments
Create alarms to monitor health and status of infrastructure and deployments
Some extremely useful dashboards were created so that developers can monitor deployment health, CPU, memory, and storage usage, and identify any resource constraints. This helps in optimizing resource allocation and ensuring that applications run smoothly. Some of these dashboards are also enriched with log integration to provide a comprehensive view of the deployments, which allows for a more in-depth analysis of application behavior.
They can be found by following the URLs in the Gitlab CI Job output. Some of the dashboards are shown below. Note that you need to filter these for your specific namespace and time range.
How to
You can learn more about how to work with the monitoring tools to understand and improve the behaviour of your applications.
Headlamp: Real-Time dashboards & Cluster access
To provide web-based real-time access to the Kubernetes clusters we deploy Headlamp. It is a graphical user interface specifically tailored for simplifying the monitoring of Kubernetes clusters. It allows real-time monitoring of the deployments such as pod’s status, custom resource definitions such as TangoDBs and investigate the deployment events, logs and metadata. This is not available in equal form for all clusters, so the mentioned dashboards might not be accessible.
Tango DeviceServers: These are specific to the Tango Controls Applications
DatabaseDS: These are specific to the Tango Controls Applications
Headlamp Home
This tool is similar to K9s in the sense that provides a graphical visualisation of the resources available in the Kubernetes API. On top of the default tools, we provide views that link with the rest of the provided solutions.
Headlamp namespace cicd metadata
This table in the namespace page provide several metadata on the namespace as the project or team that deployed it. Also, links to the job that deployed this namespace are available, which in turn lead you to all of the prebuilt logging and monitoring logs.
Also, we provide some summary views on the SKAO usage (teams, projects and users) of the given Kubernetes cluster:
Headlamp team namespace usage
Coder: Remote debugging solution
When we deploy applications to Kubernetes clusters, it is hard to do debugging when there is no access to the cluster or we can’t run a debugging session. Luckily, we have Coder. It is a self-hosted cloud development environment that integrates with multiple IDEs, providing secure access to remote development environments.
After logging in with Gitlab, you can create a workspace in the Kubernetes cluster using the “Kubernetes” template:
Coder workspace creation
After giving it a name and compute requirements (ie: CPU, RAM and Disk), you can launch it. When it launches, you can access it in multiple ways:
Coder connection options
You can access it with:
VS Code Desktop app
Browser-based JupyterLab
Browser-based VS Code
Browser-based terminal
SSH
Coder Jupyterlab terminal
From the terminal (using any connection option) we have access to the Kubernetes cluster with limited access, but it is possible to view Pod’s logs, describe pods and other resources. Other tools like Helm, K9s or tango_admin are also available.
Note that, as you are accessing the cluster directly, we have access to all the logs available and are NOT limited by logging or monitoring data’s retention policies.
How to
You can learn more about how to leverage Coder to debug your application while it is running in the Kubernetes cluster.