Functionality and Usage ======================= Environment Configuration ------------------------- To connect to a SLURM cluster, the following environment variables must be defined. .. list-table:: :widths: auto :header-rows: 1 * - Environment Variable - Description * - ``SDP_SLURMDEPLOY_SLURM_URL`` - API URL of the Slurm cluster to connect to * - ``SDP_SLURMDEPLOY_CLIENT_ID`` - JWKS Username to access the Slurm cluster * - ``SDP_SLURMDEPLOY_AZURE_URL`` - Address of the JWKS which issues JWTs * - ``SDP_SLURMDEPLOY_CLIENT_SECRET`` - Secret used to request a JWT from JWKS which is then used to access the Slurm Cluster * - ``SDP_SLURMDEPLOY_SLURM_JWT`` - User generated JWT Key Components -------------- The following modules make up the core of the SDP Slurm Deployer: .. list-table:: :widths: 20 80 :header-rows: 1 * - Module - Description * - ``slurmdeploy.py`` - Entrypoint. Loads configuration, initializes components, and starts the async event loop to handle deployment events. * - ``configdbmanager.py`` - Watches the SDP Configuration Database for new or updated ``slurm`` deployments and triggers events to the DeploymentManager. * - ``deploymentmanager.py`` - Processes deployment events, constructs job with ``mcs_label``, submits jobs via SlurmService, and handles job lifecycle. * - ``slurmservice.py`` - Interfaces with the SLURM REST API for job submission, status queries, and cancellation. Filters jobs by ``mcs_label`` and handles authentication. It supports SLURM REST API versions: ``0.0.39`` and ``0.0.40`` * - ``token.py`` - Interfaces with Azure Entra to collect a JWT for authentication credentials for the Slurm REST API. Authentication -------------- In order to submit a slurm job to the Slurm REST API, a recognised Username and JWT (JSON Web Token) must be passed with the request. This can be achieved in two ways: directly with a JWT or indirectly via JSON Web Key Set (JWKS). The slurm deployer will try the JWT method and if the correct environment variables are not supplied, default to the JWKS method. The JWT approach requires providing a username and a JWT associated with that username (client ID) via the following environment variables: - ``SDP_SLURMDEPLOY_CLIENT_ID`` - ``SDP_SLURMDEPLOY_SLURM_JWT`` For the JWKS approach, the username is the 'client id' of the slurm deployer app. The JWT is issued from a JWKS in Azure Entra. The credentials used to access and collect this JWT are the client id and client secret, which are set via the following environment variables: - ``SDP_SLURMDEPLOY_AZURE_URL`` - ``SDP_SLURMDEPLOY_CLIENT_ID`` - ``SDP_SLURMDEPLOY_CLIENT_SECRET`` Job Configuration ----------------- Slurm jobs are configured via environment variables and the SDP deployment submitted by the processing script. Deployments can overwrite defaults set in the slurm deployer. By default, the following environment variables are passed to the slurm jobs: .. list-table:: :header-rows: 1 :widths: 40 60 * - Job environment variable - Default * - ``PATH`` - ``/usr/local/bin:/usr/bin:/bin`` (loaded from ``SDP_SLURMDEPLOY_PATH`` environment variable) * - ``LD_LIBRARY_PATH`` - ``/lib/:/lib64/:/usr/local/lib`` * - ``SLURM_SUBMIT_DIR`` - ``None`` or ``current_working_directory`` set in the SDP Deployment * - ``HOME`` - ``/home/None`` (``/home/{self.username}``) In addition, any project environment variables starting with "SDP_SLURM_ENV" are also loaded to the job environment with variable names stripped from "SDP_SLURM_ENV", e.g.: .. code-block:: bash KUBECONFIG = os.getenv("SDP_SLURM_ENV_KUBECONFIG") Job Labelling ------------- To ensure jobs can be tracked per deployment instance, each job submitted by the deployer is tagged with a unique ``mcs_label``. This label is automatically generated using the following environment variables: - ``SDP_SLURMDEPLOY_RELEASE_NAME`` - ``SDP_SLURMDEPLOY_NAMESPACE`` Label format:: sdp_slurm_deployer_{SDP_SLURMDEPLOY_RELEASE_NAME}_{SDP_SLURMDEPLOY_NAMESPACE} This labeling ensures that only jobs associated with the current deployment are queried or cancelled, avoiding conflicts with other users or deployments. Job Monitoring -------------- The SDP Slurm Deployer tracks the status of jobs it submits to SLURM by querying at regular intervals. Job states such as ``PENDING``, ``RUNNING``, ``COMPLETED``, and ``FAILED`` are fetched and matched to deployments defined in the SDP Configuration Database. This mapping allows the deployer to: - Update internal deployment state - Detect failed or missing jobs - Cancel jobs if a deployment is removed Below is the mapping of SLURM job states to their corresponding SDP deployment states: .. list-table:: :header-rows: 1 :widths: 30 30 * - SLURM Job State - SDP Deployment State * - ``RUNNING`` - ``RUNNING`` * - ``PENDING`` - ``PENDING`` * - ``COMPLETED`` - ``FINISHED`` * - ``BOOT_FAIL`` - ``FAILED`` * - ``DEADLINE`` - ``FAILED`` * - ``FAILED`` - ``FAILED`` * - ``NODE_FAIL`` - ``FAILED`` * - ``OUT_OF_MEMORY`` - ``FAILED`` * - ``PREEMPTED`` - ``FAILED`` * - ``SUSPENDED`` - ``FAILED`` * - ``TIMEOUT`` - ``FAILED`` * - ``CANCELLED`` - ``CANCELLED`` Monitoring is handled in the `DeploymentManager` via the ``fetch_deployments_and_slurm_jobs()`` method, with API interactions abstracted in `SlurmService`. SDP Deployment -------------- The SDP Slurm Deployer can be deployed in the SDP system by enabling it during the installation. This allows the deployer to submit ``slurm`` jobs to the SLURM cluster as needed. To enable the Slurm Deployer during installation, append the following ``--set`` arguments to your ``helm upgrade`` command: .. code-block:: bash --set slurmdeploy.enabled=true For detailed instructions on installing the SDP, refer to the `SDP Installation Guide `_.