Functionality and Usage

Environment Configuration

To connect to a SLURM cluster, the following environment variables must be defined.

Environment Variable

Description

SDP_SLURMDEPLOY_SLURM_URL

API URL of the Slurm cluster to connect to

SDP_SLURMDEPLOY_CLIENT_ID

JWKS Username to access the Slurm cluster

SDP_SLURMDEPLOY_AZURE_URL

Address of the JWKS which issues JWTs

SDP_SLURMDEPLOY_CLIENT_SECRET

Secret used to request a JWT from JWKS which is then used to access the Slurm Cluster

SDP_SLURMDEPLOY_SLURM_JWT

User generated JWT

Key Components

The following modules make up the core of the SDP Slurm Deployer:

Module

Description

slurmdeploy.py

Entrypoint. Loads configuration, initializes components, and starts the async event loop to handle deployment events.

configdbmanager.py

Watches the SDP Configuration Database for new or updated slurm deployments and triggers events to the DeploymentManager.

deploymentmanager.py

Processes deployment events, constructs job with mcs_label, submits jobs via SlurmService, and handles job lifecycle.

slurmservice.py

Interfaces with the SLURM REST API for job submission, status queries, and cancellation. Filters jobs by mcs_label and handles authentication. It supports SLURM REST API versions: 0.0.39 and 0.0.40

token.py

Interfaces with Azure Entra to collect a JWT for authentication credentials for the Slurm REST API.

Authentication

In order to submit a slurm job to the Slurm REST API, a recognised Username and JWT (JSON Web Token) must be passed with the request.

This can be achieved in two ways: directly with a JWT or indirectly via JSON Web Key Set (JWKS). The slurm deployer will try the JWT method and if the correct environment variables are not supplied, default to the JWKS method.

The JWT approach requires providing a username and a JWT associated with that username (client ID) via the following environment variables:

  • SDP_SLURMDEPLOY_CLIENT_ID

  • SDP_SLURMDEPLOY_SLURM_JWT

For the JWKS approach, the username is the ‘client id’ of the slurm deployer app. The JWT is issued from a JWKS in Azure Entra. The credentials used to access and collect this JWT are the client id and client secret, which are set via the following environment variables:

  • SDP_SLURMDEPLOY_AZURE_URL

  • SDP_SLURMDEPLOY_CLIENT_ID

  • SDP_SLURMDEPLOY_CLIENT_SECRET

Job Configuration

Slurm jobs are configured via environment variables and the SDP deployment submitted by the processing script. Deployments can overwrite defaults set in the slurm deployer.

By default, the following environment variables are passed to the slurm jobs:

Job environment variable

Default

PATH

/usr/local/bin:/usr/bin:/bin (loaded from SDP_SLURMDEPLOY_PATH environment variable)

LD_LIBRARY_PATH

/lib/:/lib64/:/usr/local/lib

SLURM_SUBMIT_DIR

None or current_working_directory set in the SDP Deployment

HOME

/home/None (/home/{self.username})

In addition, any project environment variables starting with “SDP_SLURM_ENV” are also loaded to the job environment with variable names stripped from “SDP_SLURM_ENV”, e.g.:

KUBECONFIG = os.getenv("SDP_SLURM_ENV_KUBECONFIG")

Job Labelling

To ensure jobs can be tracked per deployment instance, each job submitted by the deployer is tagged with a unique mcs_label. This label is automatically generated using the following environment variables:

  • SDP_SLURMDEPLOY_RELEASE_NAME

  • SDP_SLURMDEPLOY_NAMESPACE

Label format:

sdp_slurm_deployer_{SDP_SLURMDEPLOY_RELEASE_NAME}_{SDP_SLURMDEPLOY_NAMESPACE}

This labeling ensures that only jobs associated with the current deployment are queried or cancelled, avoiding conflicts with other users or deployments.

Job Monitoring

The SDP Slurm Deployer tracks the status of jobs it submits to SLURM by querying at regular intervals.

Job states such as PENDING, RUNNING, COMPLETED, and FAILED are fetched and matched to deployments defined in the SDP Configuration Database. This mapping allows the deployer to:

  • Update internal deployment state

  • Detect failed or missing jobs

  • Cancel jobs if a deployment is removed

Below is the mapping of SLURM job states to their corresponding SDP deployment states:

SLURM Job State

SDP Deployment State

RUNNING

RUNNING

PENDING

PENDING

COMPLETED

FINISHED

BOOT_FAIL

FAILED

DEADLINE

FAILED

FAILED

FAILED

NODE_FAIL

FAILED

OUT_OF_MEMORY

FAILED

PREEMPTED

FAILED

SUSPENDED

FAILED

TIMEOUT

FAILED

CANCELLED

CANCELLED

Monitoring is handled in the DeploymentManager via the fetch_deployments_and_slurm_jobs() method, with API interactions abstracted in SlurmService.

SDP Deployment

The SDP Slurm Deployer can be deployed in the SDP system by enabling it during the installation. This allows the deployer to submit slurm jobs to the SLURM cluster as needed.

To enable the Slurm Deployer during installation, append the following --set arguments to your helm upgrade command:

--set slurmdeploy.enabled=true

For detailed instructions on installing the SDP, refer to the SDP Installation Guide.