GPU Pipelines and Workloads¶
This section describes requirements and guidelines for deployment and testing of a new Python project using GPUs on GitLab. The basic guidelines build upon those of the Python Coding Guidelines, but are specific to the GPU environment and describe how to specify a GPU runner for the pipeline jobs and how to deploy a workload on a GPU node in the cluster using a Kubernetes chart deployment.
Table of Contents
Running pipeline jobs on a GPU node¶
A template for a pipeline job on a GPU node is provided in the
This template adds a new
test stage to the pipeline job, which runs the workload on the GPU node.
In order to use this template add the following to your
include: # GPU - project: 'ska-telescope/templates-repository' file: 'gitlab-ci/includes/gpu.gitlab-ci.yml'
You will probably also want to add the following to your
.gitlab-ci.yml file, specifyng that the non-GPU pipeline tests should not be run in case you aren’t using a GPU:
include: # Python - project: 'ska-telescope/templates-repository' file: 'gitlab-ci/includes/python.gitlab-ci.yml'
Alternatively, if you don’t want to use the provided GPU template, any step on your pipeline can be configured to use the GPU node by adding the following to the step:
tags: - k8srunner-gpu-v100
The unit tests themselves should be marked with the
@pytest.mark.gputest def test_cuda(): """A dummy test for a cuda function""" test = dummy.cuda_dummy_function() assert test == "cuda-function"
Deploying a workload on a GPU node¶
The STENCIL project provides a template deployment chart that can be used to deploy a workload on a GPU node.
All that’s needed to deploy the existing chart is to issue the command:
If you want to create your own chart that deploys a workload to a GPU node, you need to define the following besides the usual steps needed for a CPU workload:
# [...] image: repository: nvidia/cuda # The image to use tag: "11.0-base" # The tag to use if needed. Otherwise, leave the tag empty (i.e. "") # [...] resources: limits: nvidia.com/gpu: 1 # The maximum number of GPUs to use (this number is an integer and reserves a full physical device) requests: nvidia.com/gpu: 1 # The minimum number of GPUs to use (this number is an integer and reserves a full physical device) # [...] # The GPU nodes have a taint that prevents purely CPU workloads from being scheduled on the GPU nodes. This taint is removed by the following toleration: tolerations: - key: "nvidia.com/gpu" operator: "Equal" value: "true" effect: "NoExecute"
NOTE: The GPU resources are scarce. Reserving 1 GPU uses a full physical device for your workload and can quickly exhaust the available GPU resources.
# [...] spec: template: spec: runtimeClassName: "nvidia"
Under normal circumstances after the workload is finished, the container should be deleted. In case you need to manually remove the deployed chart, issue the following command:
This basic template project is available on GitLab. And demonstrates the following:
Provides functions and unit tests that run on a GPU worker node runner by calling the GPU gitlab CI/CD template.
Defines an example chart that deploys a workload to a GPU node.