Slurm Deployer API
Main
Slurm Deployer Module
This module provides the main deployment service for managing Slurm jobs in the SDP system. It handles the continuous monitoring and management of deployments between the Configuration Database and Slurm, including job submission, state tracking, and lifecycle management.
Environment Variables
- SDP_LOG_LEVELstr, optional
Logging level for the deployer (default: ‘DEBUG’)
- SDP_SLURMDEPLOY_LOOP_INTERVALint, optional
Time interval in seconds between deployment checks (default: 60)
- SDP_SLURMDEPLOY_LIVENESS_FILEstr, optional
Path to the liveness probe file (default: ‘/tmp/alive’)
- SDP_SLURMDEPLOY_SLURM_URLstr
URL of the Slurm REST API
- SDP_SLURMDEPLOY_AZURE_URLstr
URL to fetch JWT from Azure
- SDP_SLURMDEPLOY_CLIENT_IDstr
Identifier for slurm deployer for Azure and slurm user mapping
- SDP_SLURMDEPLOY_CLIENT_SECRETstr
Secret for accessing Azure
- exception ska_sdp_slurmdeploy.slurmdeploy.MissingAuthEnvVar
Exception raised when the authentication method for configuring the slurm service is not defined.
This exception means that either the given auth method is not supported, or it was not defined at all.
- ska_sdp_slurmdeploy.slurmdeploy.configure_slurmservice()
The Slurm service can be configured to authenticate with ‘JWT’ or ‘JWKS’: Each choice in turn requires further environment variables to be set: - JWT requires:
SDP_SLURMDEPLOY_SLURM_URL SDP_SLURMDEPLOY_CLIENT_ID SDP_SLURMDEPLOY_SLURM_JWT
JWKS Requires the following to create a JWT token:
SDP_SLURMDEPLOY_AZURE_URL SDP_SLURMDEPLOY_CLIENT_ID SDP_SLURMDEPLOY_CLIENT_SECRET
Then the following to configure the service:
SDP_SLURMDEPLOY_SLURM_URL SDP_SLURMDEPLOY_CLIENT_ID
- ska_sdp_slurmdeploy.slurmdeploy.main(backend=None)
Entry point for the Slurm deployer service.
This function initializes all necessary components and starts the main deployment service loop. It handles: 1. Signal handler setup for graceful termination 2. Configuration Database client initialization 3. Slurm service setup with authentication 4. Deployment manager initialization 5. Main service loop execution 6. Proper cleanup on exit
Parameters
- backendobject, optional
Backend configuration for the Configuration Database client. If None, uses the default backend.
Notes
Required environment variables: - SDP_SLURMDEPLOY_SLURM_URL - SDP_SLURMDEPLOY_AZURE_URL - SDP_SLURMDEPLOY_CLIENT_ID - SDP_SLURMDEPLOY_CLIENT_SECRET
- ska_sdp_slurmdeploy.slurmdeploy.slurm_deployer(config: Config)
Main deployment service loop for managing Slurm jobs.
This function runs continuously, monitoring the Configuration Database for new deployments and managing existing ones. For each iteration: 1. Takes ownership of the Configuration Database transaction 2. Creates a connection to the slurm service 3. Updates the liveness probe file 4. Refreshes the connection to the Configuration Database 5. Fetches current deployments and Slurm jobs 6. Processes new deployments by submitting them to Slurm 7. Monitors and updates states of existing deployments
Parameters
- configConfig
Configuration Database client instance
Notes
Loop interval controlled by SDP_SLURMDEPLOY_LOOP_INTERVAL env var
Liveness probe file is touched in each iteration for health monitoring
Continues running until terminated by a SIGTERM signal
- ska_sdp_slurmdeploy.slurmdeploy.terminate(_signame, _frame)
Signal handler for graceful termination.
This function is called when the process receives a SIGTERM signal. It logs the termination request and exits the process cleanly.
Parameters
- _signamestr
Name of the signal (unused)
- _frameframe
Current stack frame (unused)
Slurm service
Slurm REST Service Module
This module provides a service layer for interacting with Slurm workload manager through its REST API. It handles job submissions, status queries, and job information retrieval while abstracting the underlying HTTP communication details.
Environment Variables
- SDP_SLURMDEPLOY_RELEASE_NAMEstr, optional
The Helm release name (default: ‘slurm-deployer’)
- SDP_SLURMDEPLOY_NAMESPACEstr, optional
The namespace where the application is deployed (default: ‘sdp’)
- class ska_sdp_slurmdeploy.slurmservice.Slurm39(url: str, username: str, token: Token | str, timeout: int = 5)
Slurm API v0.0.39 usage.
- class ska_sdp_slurmdeploy.slurmservice.Slurm40(url: str, username: str, token: Token | str, timeout: int = 5)
Slurm API v0.0.40 usage.
- exception ska_sdp_slurmdeploy.slurmservice.SlurmException(errors: str | list, response)
Slurm Exception for API errors.
- class ska_sdp_slurmdeploy.slurmservice.SlurmGeneric(url: str, username: str, token: Token | str, timeout: int = 5)
Basic layout for the Slurm API, based on API 0.0.39
- cancel_job(job_id: int) bool
Cancel slurm job by job id.
- Param:
job_id: The Slurm Job ID
- Returns:
True if the job was cancelled.
- Raises:
SlurmExceptionif command fails
- filter_jobs_by_mcs_label(jobs: list[SlurmJob], mcs_label: str) list[SlurmJob]
Filters a list of job dictionaries by mcs_label.
Parameters
- jobslist of dict
A list of job dictionaries.
- mcs_labelstr
The mcs_label to filter by.
Returns
- list of SlurmJob
A list of SlurmJob objects that match the mcs_label.
- get_job(job_id: int) SlurmJob | None
Retrieve information about a specific Slurm job.
- Param:
job_id: The Slurm Job ID
- Returns:
Slurm job if found
- Raises:
SlurmExceptionif command fails
- list_jobs() list[SlurmJob] | None
Retrieve all jobs from the Slurm cluster.
- Returns:
List of slurm jobs
- Raises:
SlurmExceptionif command fails
- ping() dict
Ping the server.
- Returns:
A dictionary representing the server state
- Raises:
SlurmExceptionif command fails
- post_job_submit(job_name: str, args: dict) int | None
Submit a new job to the Slurm cluster.
- Param:
job_name: The name to assign to the job.
- Args:
The Job configuration parameters
- Returns:
The Job ID
- Raises:
SlurmExceptionif command fails
- class ska_sdp_slurmdeploy.slurmservice.SlurmJob(name: str, job_id: int, job_state: str, job_state_reason: str | None = None, state_reason: str | None = None, state_description: str | None = None, mcs_label: str | None = None)
A dataclass representing a Slurm job and its current state.
- job_id: int
Unique identifier assigned by Slurm
- job_state: str
Current state of the job (e.g., ‘PENDING’, ‘RUNNING’, ‘COMPLETED’)
- job_state_reason: str | None = None
This will almost always be None, but in some cases it might not be
- mcs_label: str | None = None
The mcs label added to the Slurm job to identify jobs associated with a perticular deployment of SDP
- name: str
Name of the Slurm job
- state_description: str | None = None
Optional details for state_reason
- state_reason: str | None = None
The reason given for the current state
- class ska_sdp_slurmdeploy.slurmservice.SlurmService(slurm_url: str, slurm_username: str, slurm_jwt: Token | str, timeout: int = 5)
The interface to manage the Slurm service,
- cancel_job(job_id: int) bool
Cancel slurm job by job id.
- Param:
job_id: The Slurm Job ID
- Returns:
True if the job was cancelled.
- Raises:
SlurmExceptionif command fails
- get_job(job_id) SlurmJob | None
Retrieve information about a specific Slurm job.
- Parameters:
job_id (int) – The Slurm Job ID
- Returns:
Slurm job if found and matches the deployment label, otherwise None.
- Return type:
SlurmJob | None
- get_jobs() list[SlurmJob]
Retrieve all jobs from the Slurm cluster.
- Returns:
List of slurm jobs
- Raises:
SlurmExceptionif command fails
- ping()
Ping the server.
- Returns:
A dictionary representing the server state
- Raises:
SlurmExceptionif command fails
- post_job_submit(job_name: str, args: dict) int
Submit a new job to the Slurm cluster.
- Param:
job_name: The name to assign tot he job.
- Args:
The Job configuration parameters
- Returns:
The Job ID
- Raises:
SlurmExceptionif command fails
Deployment manager
SDP Slurm Deployment Manager Module
This module provides functionality to manage Slurm deployments and their states in the SDP Configuration Database. It handles the synchronization between Slurm job states and their corresponding deployment states in the Configuration Database.
- class ska_sdp_slurmdeploy.deploymentmanager.DeploymentManager(slurmservice, configdb_manager)
Manager class for handling Slurm deployments and Configuration Database synchronization.
This class is responsible for managing the lifecycle of Slurm deployments and their corresponding states in the Configuration Database. It handles job submission, state updates, and synchronization between Slurm and the Configuration Database.
Attributes
- slurmserviceobject
Service object for interacting with Slurm
- configdb_managerConfigDBManager
Manager for Configuration Database operations
- slurm_jobsset
Set of job names currently in Slurm
- configdb_jobsset
Set of deployment keys from Configuration Database
- JOB_PREFIX = 'sdp'
- cancel_slurm_job(deployment)
Cancel slurm job.
Parameters
- slurm_deploymentSlurmCancelDeployment
The deployment which needs to be cancelled.
- static create_slurm_deployment_state(key, job: SlurmJob)
Create a deployment state object for a Slurm job.
Maps Slurm job states to SDP deployment states using the state mapping.
Parameters
- keystr
Deployment key
- jobSlurmJob
Slurm job object containing job information
Returns
- SlurmDeploymentState
New deployment state object
- fetch_deployments_and_slurm_jobs()
Refresh the lists of deployments from Configuration Database and jobs from Slurm.
This method updates the internal sets of jobs from both systems for tracking and synchronization purposes.
Raises
- SlurmFetchJobException
If unable to retrieve job information from Slurm
- get_deployments_to_cancel() Generator[SlurmWatchDeployment, None, None]
Get deployments to cancel
- get_deployments_to_watch() Generator[SlurmWatchDeployment, None, None]
Generate deployments that need to be monitored.
Yields deployments that exist in both Slurm and the Configuration Database for state monitoring.
Yields
- SlurmDeployment
Deployment objects that need to be monitored
- get_new_deployments() Generator[SlurmWatchDeployment, None, None]
Generate new Slurm deployments that need to be processed.
Yields deployments that exist in the Configuration Database but not yet in Slurm, filtering for only those of type ‘slurm’.
Yields
- SlurmDeployment
New deployment objects that need to be processed
- refresh_connection(watcher)
Refresh the Configuration Database connection using the provided watcher.
Parameters
- watcherobject
Watcher object for maintaining Configuration Database connection
- submit_job_to_slurm(slurm_deployment) None
Submit a deployment to Slurm and create its initial state.
If the submission fails, creates a failed state in the Configuration Database.
Parameters
- slurm_deploymentSlurmDeployment
The deployment to be submitted to Slurm
- update_deployment_state(slurm_deployment: SlurmDeployment) None
Update the state of a Slurm deployment if it is not “FINISHED”, “CANCELLED” or “FAILED”.
Parameters
- slurm_deploymentSlurmDeployment
The deployment whose state needs to be updated
- class ska_sdp_slurmdeploy.deploymentmanager.SlurmCancelDeployment(key: str, job_id: int)
A dataclass representing a Slurm cancel deployment.
Parameters
- job_idint
Unique id for the deployment
- job_id: int
- class ska_sdp_slurmdeploy.deploymentmanager.SlurmDeployment(key: str)
A dataclass representing a Slurm deployment with its associated transaction.
Parameters
- keystr
Unique identifier for the deployment
- key: str
- exception ska_sdp_slurmdeploy.deploymentmanager.SlurmFetchJobException
Exception raised when the deployment manager fails to list jobs from Slurm.
This exception indicates a communication or operational issue with the Slurm when attempting to retrieve job information.
- class ska_sdp_slurmdeploy.deploymentmanager.SlurmNewDeployment(key: str, args: dict, txn: DbTransaction)
A dataclass representing a Slurm new deployment.
Parameters
- argsdict
Arguments for the Slurm deployment
- txnDbTransaction
Database transaction associated with this deployment
- args: dict
- txn: DbTransaction
SDP Configuration DB manager
Configuration Database Manager Module
This module provides functionality for managing deployment states in the Configuration Database. It handles state creation, updates, and queries for Slurm deployments.
- class ska_sdp_slurmdeploy.configdbmanager.ConfigDBManager
Manager class for Configuration Database operations.
This class provides methods for managing deployment states in the Configuration Database, including creation, updates, and queries.
Attributes
- target_jobsset
Set of job keys being managed
- connectionobject
Connection to the SDP Configuration Database
- create_deployment_state(txn: Transaction, slurm_deployment_state: SlurmDeploymentState)
Create a new Slurm deployment state in the Configuration Database.
Parameters
- txnTransaction
Current database transaction
- slurm_deployment_stateSlurmDeploymentState
State object containing deployment information to be created
- fetch_jobs_from_configdb(txn)
Fetch all job keys from the Configuration Database.
Parameters
- txnTransaction
Current database transaction
Returns
- set
Set of job keys present in the database
- static get_deployment(txn, dpl_key)
Fetch a deployment configuration from the Configuration Database.
Parameters
- txnTransaction
Current database transaction
- dpl_keystr
Unique identifier for the deployment
Returns
- object or None
Deployment object if found, None otherwise
- static get_deployment_state(txn: Transaction, dpl_key: str) SlurmDeploymentState | None
Fetch the state of a deployment from the Configuration Database.
Returns None if the deployment state is not found, or if the expected ‘jobs’ structure or the specific deployment key is missing within the retrieved state.
Parameters
- txnTransaction
Current database transaction
- dpl_keystr
Key identifying the deployment
Returns
- SlurmDeploymentState or None
State of the deployment if found, None otherwise
- refresh_connection(txn)
Refresh the connection ownership to maintain active session.
Parameters
- txnTransaction
Current database transaction
- set_connection(connection)
Set or update the SDP Configuration Database connection.
Parameters
- connectionConfig
Instance of the SDP Configuration Database connection
- txn()
Start a new Configuration Database transaction.
Returns
- Transaction
A new transaction object for database operations
- update_deployment_configdb_state(txn: Transaction, new_dpl_state: SlurmDeploymentState) None
Update an existing Slurm deployment state in the Configuration Database.
Parameters
- txnTransaction
Current database transaction
- slurm_deployment_state_to_updateSlurmDeploymentState
New state to update the deployment to
- class ska_sdp_slurmdeploy.configdbmanager.SlurmDeploymentState(key: str, job_id: int, status: str, state_reason: str | None = None, state_description: str | None = None)
A dataclass representing the state of a Slurm deployment.
Parameters
- keystr
Unique identifier for the deployment
- job_idint
Slurm job ID associated with the deployment
- statusstr
Current status of the deployment (e.g., ‘RUNNING’, ‘FINISHED’, ‘FAILED’)
- state_reasonstr
The reason given for the current state. Defaults to None.
- state_descriptionstr
Optional details for state_reason. Defaults to None.
- job_id: int
- key: str
- state_description: str | None = None
- state_reason: str | None = None
- status: str
- to_dict() dict[str, any]
Serialize the deployment state to a dictionary format.
Returns
- dict[str, any]
A dictionary representing the serialized deployment state. It will always contain ‘num_job’ and ‘jobs’. The ‘jobs’ dictionary will contain an entry for this deployment state using its key, with values including ‘slurm_job_id’, ‘status’, and optionally ‘state_reason’ and ‘state_description’ if they are not None. It will also include ‘error_state’ if the deployment state is FAILED.