How to run on the AWS DP HPC cluster using SLURM ================================================ This page describes how to run the Continuum Imaging Pipeline (CIMG) on one or more nodes on the AWS DP HPC cluster using SLURM. - If you want to run inside a container on your local machine instead, see the `quickstart guide <../quickstart.html>`_. - If you want to monitor the job using the Prefect UI, see the `Prefect instructions <../usage/aws_prefect.html>`_. The scripts have been tested on the AWS DP HPC cluster. Prerequisites ------------- - An account on the AWS DP HPC cluster - This repository cloned to a directory on the AWS DP HPC cluster Steps ----- 1. Submit a SLURM job to run the pipeline ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. Log into the DP HPC headnode. 2. Change directory to the repository root folder, OR set the ``REPO_DIR`` environment variable: .. code-block:: bash export REPO_DIR=~/path/to/repo/ska-sdp-cimg 3. Edit the SLURM script `scripts/prod/aws-run-cimg.sbatch `_ if needed (paths, job parameters, number of nodes etc). If you require a specific version of the ska-sdp-spack environment: .. code-block:: bash export SPACK_TAG="2026.03.2" 4. Submit the SLURM job: The script sets up the compute environment using spack and runs the pipeline on a single compute node by default. .. code-block:: bash sbatch scripts/prod/aws-run-cimg.sbatch Output: .. code-block:: console Submitted batch job To run on multiple nodes, override the SLURM directives when submitting the job. It is important to set the number of tasks equal to the total number of nodes. .. code-block:: bash sbatch --nodes=3 --ntasks=3 --cpus-per-task=96 scripts/prod/aws-run-cimg.sbatch 5. Check job status: .. code-block:: bash squeue sacct 2. Finishing up ~~~~~~~~~~~~~~~ Once the SLURM job has finished, the data products will be available in the latest time-stamped output directory under ``$PWD/runs`` for inspection. Logs will be output to filepaths specified in the slurm script. These include: - ``slurm--.out``: standard output and error from the job (including output from WSClean which is run by the ``image`` task in ``tasks.image.py``) - ``versions--.txt``: a list of versions of key software used in the job, including spack modules and python packages - ``runs/--/monitor--/``: directory containing per-node benchmarking traces and logs when running on multiple nodes.