Kubernetes based Runners Architecture

GitLab runners are orchestrated by a Kubernetes cluster. They could be deployed to any Kubernetes clusters using the gitlab_runner ansible collection. The main architecture is illustrated below.

Table with marvin feedback type (error/warning), description and mitigation strategy

Features

  • The main runner pod is deployed with Helm Chart under gitlab namespace with the repository.

  • Main runner pod is registered to ska-telescope group shared runners with configurable tags.

  • The main pod picks up GitLab Jobs and creates on-demand pods. This is configured using helm chart values file/or config.toml file of GitLab runners below.

  • Runners are scaled according to configuration.

  • Runners have resource limits i.e. cpuRequests, memoryRequests, memoryLimit.

  • Runners are running in nodes that are specifically labelled for ci/cd jobs.

  • Runners share a cache between them that is used to speed up the job times.

  • Dedicated BuildX worker pools provide multi-platform container builds through Docker BuildX (ska-buildx tag).

  • Specific runners exist for Kubernetes Deployment/Integration testing. (ska-k8s tag).

  • Docker support

  • Kubernetes support

With this approach, GitLab Runners are proven to be a viable option to be used in a cluster with auto-scaling and easy management. Docker Support

Docker can be used in the CI/CD jobs as with the normal runners. Note that: docker-compose cannot be used in conjunction with Kubernetes! You should follow the instruction on the developer portal to set up your repo.

To elevate some of the security concerns listed below with using Docker in Docker, another docker daemon is deployed in the nodes. This daemon then used as default docker-daemon in the runner pods. Kubernetes Support

Kubernetes clusters could be created in ci/cd jobs. These clusters are created on the ci-worker nodes and destroyed at the end of the job.

Note: in order to run deploy clusters, the account permissions need to be set up correctly for the runner services.

STFC Cloud Kubernetes Clusters

For development purposes, STFC-backed clusters are the preferred method of deployment and testing, using Gitlab to deploy workloads into clusters. Currently we have two clusters, both with the same capabilities (Gitlab and Vault integrations, Binderhub, etc):

  • techops - Main cluster used by the whole project for CI/CD. It has limited support for GPUs, being mainly used to build artefacts that require GPUs

  • dp - Cluster used by the DP ART that provides more GPUs to run actual workloads

Note

The runner tags below are automatically set for SKAO provided templates (Pipeline Machinery) using variables. Therefore, we don’t advise to manually set them but use the variables below when needed:

  • SKA_DEFAULT_RUNNER: The default runner to be used. Defaults to ska-default

  • SKA_K8S_RUNNER: The default k8s runner to be used. Defaults to ska-k8s

  • SKA_GPU_RUNNER: The default GPU runner to be used. Defaults to ska-gpu-a100.

STFC Techops

Nodes

Type

Amount

CPU

Memory

GPU

Kubernetes

Version

OS

Version

Kernel

Version

GPU

Driver

Version

stfc-techops-production-cicd-md-0*

51

16

64GiB

0

v1.32.2

Ubuntu

22.04.1

LTS

5.15.0-48-generic

stfc-techops-production-cicd-md-1*

10

32

124GiB

0

v1.32.2

Ubuntu

22.04.1

LTS

5.15.0-48-generic

stfc-techops-production-cicd-gpu-*

1

30

100GiB

1

(NVIDIA

A100-PCIE-40GB)

v1.32.2

Ubuntu

22.04.1

LTS

5.15.0-48-generic

Cuda:

12.7


Driver:

565.57.01

stfc-techops-production-cicd-md-2-*

1

64

230GiB

0

v1.32.2

Ubuntu

22.04.1

LTS

5.15.0-48-generic

stfc-techops-production-cicd-buildx-md-0-*

2

60

250GiB

0

v1.32.2

Ubuntu

22.04.1

LTS

5.15.0-48-generic

Runners

Runner

Tag

CPU

Memory

CPU Limit

Memory Limit

GPUs Available

Concurrent Jobs

ska-default-runner

ska-default

2

8G

Unlimited (~16)

16Gi

0

100

ska-k8s-runner

ska-k8s

2

8G

Unlimited (~16)

16Gi

0

42

ska-default-large-runner

ska-default-large

4

16G

Unlimited (~30)

32Gi

0

5

ska-default-xlarge-runner

ska-default-xlarge

16

64G

Unlimited (~30)

128Gi

0

2

ska-gpu-a100-runner

ska-gpu-a100

2

8Gi

Unlimited (~30)

16Gi

1

1

ska-buildx-runner

ska-buildx

16

32Gi

Unlimited (~30)

64Gi

0

4

STFC DP

Nodes

Type

Amount

CPU

Memory

GPU

Kubernetes Version

OS Version

Kernel Version

GPU Driver Version

stfc-dp-production-md-0

3

16

64GiB

0

v1.32.2

Ubuntu 22.04.1 LTS

5.15.0-48-generic

stfc-dp-production-md-1

6

30

128GiB

0

v1.32.2

Ubuntu 22.04.1 LTS

5.15.0-48-generic

stfc-dp-production-gpu-md-0

1

30

100GiB

1 (NVIDIA A100-PCIE-40GB)

v1.32.2

Ubuntu 22.04.1 LTS

5.15.0-88-generic

Cuda: 12.7 | Driver: 565.57.01

stfc-dp-production-gpu-md-1

1

28

210GiB

2 (NVIDIA A100-PCIE-40GB)

v1.32.2

Ubuntu 22.04.1 LTS

5.15.0-88-generic

Cuda: 12.7 | Driver: 565.57.01

Runners

Runner

Tag

CPU

Memory

CPU Limit

Memory Limit

GPUs Available

Concurrent Jobs

ska-dp-default-runner

ska-dp-default

2

8Gi

Unlimited (~16)

16Gi

0

20

ska-dp-gpu-a100-runner

ska-dp-gpu-a100

2

8Gi

Unlimited (~16)

16Gi

3

20

Deploy to GPU nodes Using the GPU Runner

To run a job on a GPU runner, you can set the tag on your Gitlab job to one of the available GPU tags:

  • techops - ska-gpu-a100

  • dp - ska-dp-gpu-a100

Deploy pods to GPU nodes

If you have pods that need to run on GPU nodes, they must have special configurations:

POD configurations:

  • nodeSelector - This must be set to force the pods to be scheduled to GPU nodes.

node_selector:
"nvidia.com/gpu": "true"
  • tolerations: Given the node taint, the pods must tolerate that taint.

tolerations:
  - key: "nvidia.com/gpu"
    value: "true"
    effect: "NoSchedule"

CONTAINER configurations: * resource limits and requirements - Needed to claim X amount of GPU instances, just like any other resource.

resources:
limits:
  cpu: ...
  memory: ...
  nvidia.com/gpu: "<number of GPUs>"
requests:
  cpu: ...
  memory: ...
  nvidia.com/gpu: "<number of GPUs>"
  • runtimeClass: You need to properly set the runtimeClass so that the container runtime knows what profile to use to bind GPUs to a pod container.

runtimeClassName: "nvidia"

Build multi-platform images Using the BuildX runner

The OCI build template was modified to now also support ARM architectures. It is provided in gitlab-ci/includes/oci-image.gitlab-ci.yml. It creates dedicated oci-image-build-armv5 and oci-image-build-armv8 jobs (beside the already existing `oci-image-build) that execute on the ska-buildx runners when the matching variables are enabled.

To build for ARMv5 add the include and variable to your .gitlab-ci.yml:

include:
  - project: 'ska-telescope/templates-repository'
    file: 'gitlab-ci/includes/oci-image.gitlab-ci.yml'

variables:
  OCI_USE_PLATFORM_ARMV5: "true"

Switch to ARMv8 builds by setting OCI_USE_PLATFORM_ARMV8: "true" instead. Only one of the flags should be enabled at a time. When either flag is present, the template schedules the relevant job on the BuildX runner pool; the runner exposes the Docker BuildX plugins and pre-registers QEMU emulation for the requested platform.

Alternatively, if you are maintaining custom jobs, add the ska-buildx tag so the job lands on the BuildX-enabled pool:

tags:
  - ska-buildx

The BuildX worker-group manifests and GitLab runner configuration can be reused from ska-ser-infra-machinery/resources/templates/buildx-* when onboarding another cluster or environment.