GPU Pipelines and Workloads
This section describes requirements and guidelines for deployment and testing of a new Python project using GPUs on GitLab. The basic guidelines build upon those of the Python Coding Guidelines, but are specific to the GPU environment and describe how to specify a GPU runner for the pipeline jobs and how to deploy a workload on a GPU node in the cluster using a Kubernetes chart deployment.
Running pipeline jobs on a GPU node
A template for a pipeline job on a GPU node is provided in the gitlab-ci/includes/gpu.gitlab-ci.yml
location.
This template adds a new test
stage to the pipeline job, which runs the workload on the GPU node.
In order to use this template add the following to your .gitlab-ci.yml
file:
include:
# GPU
- project: 'ska-telescope/templates-repository'
file: 'gitlab-ci/includes/gpu.gitlab-ci.yml'
You will probably also want to add the following to your .gitlab-ci.yml
file, specifyng that the non-GPU pipeline tests should not be run in case you aren’t using a GPU:
include:
# Python
- project: 'ska-telescope/templates-repository'
file: 'gitlab-ci/includes/python.gitlab-ci.yml'
Alternatively, if you don’t want to use the provided GPU template, any step on your pipeline can be configured to use the GPU node by adding the following to the step:
tags:
- k8srunner-gpu-v100
The unit tests themselves should be marked with the gputest
tag.
@pytest.mark.gputest
def test_cuda():
"""A dummy test for a cuda function"""
test = dummy.cuda_dummy_function()
assert test == "cuda-function"
Deploying a workload on a GPU node
The STENCIL project provides a template deployment chart that can be used to deploy a workload on a GPU node.
All that’s needed to deploy the existing chart is to issue the command:
make k8s-install-chart
If you want to create your own chart that deploys a workload to a GPU node, you need to define the following besides the usual steps needed for a CPU workload:
On the values.yaml
file:
# [...]
image:
repository: nvidia/cuda # The image to use
tag: "11.0-base" # The tag to use if needed. Otherwise, leave the tag empty (i.e. "")
# [...]
resources:
limits:
nvidia.com/gpu: 1 # The maximum number of GPUs to use (this number is an integer and reserves a full physical device)
requests:
nvidia.com/gpu: 1 # The minimum number of GPUs to use (this number is an integer and reserves a full physical device)
# [...]
# The GPU nodes have a taint that prevents purely CPU workloads from being scheduled on the GPU nodes. This taint is removed by the following toleration:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoExecute"
NOTE: The GPU resources are scarce. Reserving 1 GPU uses a full physical device for your workload and can quickly exhaust the available GPU resources.
On the deployment.yaml
file:
# [...]
spec:
template:
spec:
runtimeClassName: "nvidia"
Under normal circumstances after the workload is finished, the container should be deleted. In case you need to manually remove the deployed chart, issue the following command:
make k8s-uninstall-chart
Summary
This basic template project is available on GitLab. And demonstrates the following:
Provides functions and unit tests that run on a GPU worker node runner by calling the GPU gitlab CI/CD template.
Defines an example chart that deploys a workload to a GPU node.