Slurm Deployer API

Main

Slurm Deployer Module

This module provides the main deployment service for managing Slurm jobs in the SDP system. It handles the continuous monitoring and management of deployments between the Configuration Database and Slurm, including job submission, state tracking, and lifecycle management.

Environment Variables

SDP_LOG_LEVELstr, optional

Logging level for the deployer (default: ‘DEBUG’)

SDP_SLURMDEPLOY_LOOP_INTERVALint, optional

Time interval in seconds between deployment checks (default: 60)

SDP_SLURMDEPLOY_LIVENESS_FILEstr, optional

Path to the liveness probe file (default: ‘/tmp/alive’)

SDP_SLURMDEPLOY_SLURM_URLstr

URL of the Slurm REST API

SDP_SLURMDEPLOY_AZURE_URLstr

URL to fetch JWT from Azure

SDP_SLURMDEPLOY_CLIENT_IDstr

Identifier for slurm deployer for Azure and slurm user mapping

SDP_SLURMDEPLOY_CLIENT_SECRETstr

Secret for accessing Azure

exception ska_sdp_slurmdeploy.slurmdeploy.MissingAuthEnvVar

Exception raised when the authentication method for configuring the slurm service is not defined.

This exception means that either the given auth method is not supported, or it was not defined at all.

ska_sdp_slurmdeploy.slurmdeploy.configure_slurmservice()

The Slurm service can be configured to authenticate with ‘JWT’ or ‘JWKS’: Each choice in turn requires further environment variables to be set: - JWT requires:

SDP_SLURMDEPLOY_SLURM_URL SDP_SLURMDEPLOY_CLIENT_ID SDP_SLURMDEPLOY_SLURM_JWT

  • JWKS Requires the following to create a JWT token:

SDP_SLURMDEPLOY_AZURE_URL SDP_SLURMDEPLOY_CLIENT_ID SDP_SLURMDEPLOY_CLIENT_SECRET

  • Then the following to configure the service:

SDP_SLURMDEPLOY_SLURM_URL SDP_SLURMDEPLOY_CLIENT_ID

ska_sdp_slurmdeploy.slurmdeploy.main(backend=None)

Entry point for the Slurm deployer service.

This function initializes all necessary components and starts the main deployment service loop. It handles: 1. Signal handler setup for graceful termination 2. Configuration Database client initialization 3. Slurm service setup with authentication 4. Deployment manager initialization 5. Main service loop execution 6. Proper cleanup on exit

Parameters

backendobject, optional

Backend configuration for the Configuration Database client. If None, uses the default backend.

Notes

Required environment variables: - SDP_SLURMDEPLOY_SLURM_URL - SDP_SLURMDEPLOY_AZURE_URL - SDP_SLURMDEPLOY_CLIENT_ID - SDP_SLURMDEPLOY_CLIENT_SECRET

ska_sdp_slurmdeploy.slurmdeploy.slurm_deployer(config: Config)

Main deployment service loop for managing Slurm jobs.

This function runs continuously, monitoring the Configuration Database for new deployments and managing existing ones. For each iteration: 1. Takes ownership of the Configuration Database transaction 2. Creates a connection to the slurm service 3. Updates the liveness probe file 4. Refreshes the connection to the Configuration Database 5. Fetches current deployments and Slurm jobs 6. Processes new deployments by submitting them to Slurm 7. Monitors and updates states of existing deployments

Parameters

configConfig

Configuration Database client instance

Notes

  • Loop interval controlled by SDP_SLURMDEPLOY_LOOP_INTERVAL env var

  • Liveness probe file is touched in each iteration for health monitoring

  • Continues running until terminated by a SIGTERM signal

ska_sdp_slurmdeploy.slurmdeploy.terminate(_signame, _frame)

Signal handler for graceful termination.

This function is called when the process receives a SIGTERM signal. It logs the termination request and exits the process cleanly.

Parameters

_signamestr

Name of the signal (unused)

_frameframe

Current stack frame (unused)

Slurm service

Slurm REST Service Module

This module provides a service layer for interacting with Slurm workload manager through its REST API. It handles job submissions, status queries, and job information retrieval while abstracting the underlying HTTP communication details.

Environment Variables

SDP_SLURMDEPLOY_RELEASE_NAMEstr, optional

The Helm release name (default: ‘slurm-deployer’)

SDP_SLURMDEPLOY_NAMESPACEstr, optional

The namespace where the application is deployed (default: ‘sdp’)

class ska_sdp_slurmdeploy.slurmservice.Slurm39(url: str, username: str, token: Token | str, timeout: int = 5)

Slurm API v0.0.39 usage.

class ska_sdp_slurmdeploy.slurmservice.Slurm40(url: str, username: str, token: Token | str, timeout: int = 5)

Slurm API v0.0.40 usage.

exception ska_sdp_slurmdeploy.slurmservice.SlurmException(errors: str | list, response)

Slurm Exception for API errors.

class ska_sdp_slurmdeploy.slurmservice.SlurmGeneric(url: str, username: str, token: Token | str, timeout: int = 5)

Basic layout for the Slurm API, based on API 0.0.39

cancel_job(job_id: int) bool

Cancel slurm job by job id.

Param:

job_id: The Slurm Job ID

Returns:

True if the job was cancelled.

Raises:

SlurmException if command fails

filter_jobs_by_mcs_label(jobs: list[SlurmJob], mcs_label: str) list[SlurmJob]

Filters a list of job dictionaries by mcs_label.

Parameters

jobslist of dict

A list of job dictionaries.

mcs_labelstr

The mcs_label to filter by.

Returns

list of SlurmJob

A list of SlurmJob objects that match the mcs_label.

get_job(job_id: int) SlurmJob | None

Retrieve information about a specific Slurm job.

Param:

job_id: The Slurm Job ID

Returns:

Slurm job if found

Raises:

SlurmException if command fails

list_jobs() list[SlurmJob] | None

Retrieve all jobs from the Slurm cluster.

Returns:

List of slurm jobs

Raises:

SlurmException if command fails

ping() dict

Ping the server.

Returns:

A dictionary representing the server state

Raises:

SlurmException if command fails

post_job_submit(job_name: str, args: dict) int | None

Submit a new job to the Slurm cluster.

Param:

job_name: The name to assign to the job.

Args:

The Job configuration parameters

Returns:

The Job ID

Raises:

SlurmException if command fails

class ska_sdp_slurmdeploy.slurmservice.SlurmJob(name: str, job_id: int, job_state: str, job_state_reason: str | None = None, state_reason: str | None = None, state_description: str | None = None, mcs_label: str | None = None)

A dataclass representing a Slurm job and its current state.

job_id: int

Unique identifier assigned by Slurm

job_state: str

Current state of the job (e.g., ‘PENDING’, ‘RUNNING’, ‘COMPLETED’)

job_state_reason: str | None = None

This will almost always be None, but in some cases it might not be

mcs_label: str | None = None

The mcs label added to the Slurm job to identify jobs associated with a perticular deployment of SDP

name: str

Name of the Slurm job

state_description: str | None = None

Optional details for state_reason

state_reason: str | None = None

The reason given for the current state

class ska_sdp_slurmdeploy.slurmservice.SlurmService(slurm_url: str, slurm_username: str, slurm_jwt: Token | str, timeout: int = 5)

The interface to manage the Slurm service,

cancel_job(job_id: int) bool

Cancel slurm job by job id.

Param:

job_id: The Slurm Job ID

Returns:

True if the job was cancelled.

Raises:

SlurmException if command fails

get_job(job_id) SlurmJob | None

Retrieve information about a specific Slurm job.

Parameters:

job_id (int) – The Slurm Job ID

Returns:

Slurm job if found and matches the deployment label, otherwise None.

Return type:

SlurmJob | None

get_jobs() list[SlurmJob]

Retrieve all jobs from the Slurm cluster.

Returns:

List of slurm jobs

Raises:

SlurmException if command fails

ping()

Ping the server.

Returns:

A dictionary representing the server state

Raises:

SlurmException if command fails

post_job_submit(job_name: str, args: dict) int

Submit a new job to the Slurm cluster.

Param:

job_name: The name to assign tot he job.

Args:

The Job configuration parameters

Returns:

The Job ID

Raises:

SlurmException if command fails

Deployment manager

SDP Slurm Deployment Manager Module

This module provides functionality to manage Slurm deployments and their states in the SDP Configuration Database. It handles the synchronization between Slurm job states and their corresponding deployment states in the Configuration Database.

class ska_sdp_slurmdeploy.deploymentmanager.DeploymentManager(slurmservice, configdb_manager)

Manager class for handling Slurm deployments and Configuration Database synchronization.

This class is responsible for managing the lifecycle of Slurm deployments and their corresponding states in the Configuration Database. It handles job submission, state updates, and synchronization between Slurm and the Configuration Database.

Attributes

slurmserviceobject

Service object for interacting with Slurm

configdb_managerConfigDBManager

Manager for Configuration Database operations

slurm_jobsset

Set of job names currently in Slurm

configdb_jobsset

Set of deployment keys from Configuration Database

JOB_PREFIX = 'sdp'
cancel_slurm_job(deployment)

Cancel slurm job.

Parameters

slurm_deploymentSlurmCancelDeployment

The deployment which needs to be cancelled.

static create_slurm_deployment_state(key, job: SlurmJob)

Create a deployment state object for a Slurm job.

Maps Slurm job states to SDP deployment states using the state mapping.

Parameters

keystr

Deployment key

jobSlurmJob

Slurm job object containing job information

Returns

SlurmDeploymentState

New deployment state object

fetch_deployments_and_slurm_jobs()

Refresh the lists of deployments from Configuration Database and jobs from Slurm.

This method updates the internal sets of jobs from both systems for tracking and synchronization purposes.

Raises

SlurmFetchJobException

If unable to retrieve job information from Slurm

get_deployments_to_cancel() Generator[SlurmWatchDeployment, None, None]

Get deployments to cancel

get_deployments_to_watch() Generator[SlurmWatchDeployment, None, None]

Generate deployments that need to be monitored.

Yields deployments that exist in both Slurm and the Configuration Database for state monitoring.

Yields

SlurmDeployment

Deployment objects that need to be monitored

get_new_deployments() Generator[SlurmWatchDeployment, None, None]

Generate new Slurm deployments that need to be processed.

Yields deployments that exist in the Configuration Database but not yet in Slurm, filtering for only those of type ‘slurm’.

Yields

SlurmDeployment

New deployment objects that need to be processed

refresh_connection(watcher)

Refresh the Configuration Database connection using the provided watcher.

Parameters

watcherobject

Watcher object for maintaining Configuration Database connection

submit_job_to_slurm(slurm_deployment) None

Submit a deployment to Slurm and create its initial state.

If the submission fails, creates a failed state in the Configuration Database.

Parameters

slurm_deploymentSlurmDeployment

The deployment to be submitted to Slurm

update_deployment_state(slurm_deployment: SlurmDeployment) None

Update the state of a Slurm deployment if it is not “FINISHED”, “CANCELLED” or “FAILED”.

Parameters

slurm_deploymentSlurmDeployment

The deployment whose state needs to be updated

class ska_sdp_slurmdeploy.deploymentmanager.SlurmCancelDeployment(key: str, job_id: int)

A dataclass representing a Slurm cancel deployment.

Parameters

job_idint

Unique id for the deployment

job_id: int
class ska_sdp_slurmdeploy.deploymentmanager.SlurmDeployment(key: str)

A dataclass representing a Slurm deployment with its associated transaction.

Parameters

keystr

Unique identifier for the deployment

key: str
exception ska_sdp_slurmdeploy.deploymentmanager.SlurmFetchJobException

Exception raised when the deployment manager fails to list jobs from Slurm.

This exception indicates a communication or operational issue with the Slurm when attempting to retrieve job information.

class ska_sdp_slurmdeploy.deploymentmanager.SlurmNewDeployment(key: str, args: dict, txn: DbTransaction)

A dataclass representing a Slurm new deployment.

Parameters

argsdict

Arguments for the Slurm deployment

txnDbTransaction

Database transaction associated with this deployment

args: dict
txn: DbTransaction
class ska_sdp_slurmdeploy.deploymentmanager.SlurmWatchDeployment(key: str, txn: DbTransaction)

A dataclass representing a Slurm watch deployment.

Parameters

txnDbTransaction

Database transaction associated with this deployment

txn: DbTransaction

SDP Configuration DB manager

Configuration Database Manager Module

This module provides functionality for managing deployment states in the Configuration Database. It handles state creation, updates, and queries for Slurm deployments.

class ska_sdp_slurmdeploy.configdbmanager.ConfigDBManager

Manager class for Configuration Database operations.

This class provides methods for managing deployment states in the Configuration Database, including creation, updates, and queries.

Attributes

target_jobsset

Set of job keys being managed

connectionobject

Connection to the SDP Configuration Database

create_deployment_state(txn: Transaction, slurm_deployment_state: SlurmDeploymentState)

Create a new Slurm deployment state in the Configuration Database.

Parameters

txnTransaction

Current database transaction

slurm_deployment_stateSlurmDeploymentState

State object containing deployment information to be created

fetch_jobs_from_configdb(txn)

Fetch all job keys from the Configuration Database.

Parameters

txnTransaction

Current database transaction

Returns

set

Set of job keys present in the database

static get_deployment(txn, dpl_key)

Fetch a deployment configuration from the Configuration Database.

Parameters

txnTransaction

Current database transaction

dpl_keystr

Unique identifier for the deployment

Returns

object or None

Deployment object if found, None otherwise

static get_deployment_state(txn: Transaction, dpl_key: str) SlurmDeploymentState | None

Fetch the state of a deployment from the Configuration Database.

Returns None if the deployment state is not found, or if the expected ‘jobs’ structure or the specific deployment key is missing within the retrieved state.

Parameters

txnTransaction

Current database transaction

dpl_keystr

Key identifying the deployment

Returns

SlurmDeploymentState or None

State of the deployment if found, None otherwise

refresh_connection(txn)

Refresh the connection ownership to maintain active session.

Parameters

txnTransaction

Current database transaction

set_connection(connection)

Set or update the SDP Configuration Database connection.

Parameters

connectionConfig

Instance of the SDP Configuration Database connection

txn()

Start a new Configuration Database transaction.

Returns

Transaction

A new transaction object for database operations

update_deployment_configdb_state(txn: Transaction, new_dpl_state: SlurmDeploymentState) None

Update an existing Slurm deployment state in the Configuration Database.

Parameters

txnTransaction

Current database transaction

slurm_deployment_state_to_updateSlurmDeploymentState

New state to update the deployment to

class ska_sdp_slurmdeploy.configdbmanager.SlurmDeploymentState(key: str, job_id: int, status: str, state_reason: str | None = None, state_description: str | None = None)

A dataclass representing the state of a Slurm deployment.

Parameters

keystr

Unique identifier for the deployment

job_idint

Slurm job ID associated with the deployment

statusstr

Current status of the deployment (e.g., ‘RUNNING’, ‘FINISHED’, ‘FAILED’)

state_reasonstr

The reason given for the current state. Defaults to None.

state_descriptionstr

Optional details for state_reason. Defaults to None.

job_id: int
key: str
state_description: str | None = None
state_reason: str | None = None
status: str
to_dict() dict[str, any]

Serialize the deployment state to a dictionary format.

Returns

dict[str, any]

A dictionary representing the serialized deployment state. It will always contain ‘num_job’ and ‘jobs’. The ‘jobs’ dictionary will contain an entry for this deployment state using its key, with values including ‘slurm_job_id’, ‘status’, and optionally ‘state_reason’ and ‘state_description’ if they are not None. It will also include ‘error_state’ if the deployment state is FAILED.