How to run on the AWS DP HPC cluster using SLURM

This page describes how to run the Continuum Imaging Pipeline (CIMG) on one or more nodes on the AWS DP HPC cluster using SLURM.

If you want to run inside a container on your local machine instead, see the quickstart guide.
If you want to monitor the job using the Prefect UI, see the Prefect instructions.

The scripts have been tested on the AWS DP HPC cluster.

Prerequisites

Log into the DP HPC headnode.
Change directory to the repository root folder, OR set the REPO_DIR environment variable:
```
export REPO_DIR=~/path/to/repo/ska-sdp-cimg
```
Edit the SLURM script scripts/prod/aws-run-cimg.sbatch if needed (paths, job parameters, number of nodes etc).

If you require a specific version of the ska-sdp-spack environment:
```
export SPACK_TAG="2026.03.2"
```
Submit the SLURM job:

The script sets up the compute environment using spack and runs the pipeline on a single compute node by default.
```
sbatch scripts/prod/aws-run-cimg.sbatch
```
Output:
```
Submitted batch job <job_id>
```
To run on multiple nodes, override the SLURM directives when submitting the job. It is important to set the number of tasks equal to the total number of nodes.
```
sbatch --nodes=3 --ntasks=3 --cpus-per-task=96 scripts/prod/aws-run-cimg.sbatch
```
Check job status:
```
squeue
sacct
```

Once the SLURM job has finished, the data products will be available in the latest time-stamped output directory under $PWD/runs for inspection.

Logs will be output to filepaths specified in the slurm script. These include:

slurm-<job_name>-<job_id>.out: standard output and error from the job (including output from WSClean which is run by the image task in tasks.image.py)
versions-<job_name>-<job_id>.txt: a list of versions of key software used in the job, including spack modules and python packages
runs/<job_name>-<job_id>-<timestamp>/monitor-<job_name>-<job_id>/: directory containing per-node benchmarking traces and logs when running on multiple nodes.