How to run on the AWS DP HPC cluster using SLURM

This page describes how to run the Continuum Imaging Pipeline (CIMG) on one or more nodes on the AWS DP HPC cluster using SLURM.

The scripts have been tested on the AWS DP HPC cluster.

Prerequisites

  • An account on the AWS DP HPC cluster

  • This repository cloned to a directory on the AWS DP HPC cluster

Steps

1. Submit a SLURM job to run the pipeline

  1. Log into the DP HPC headnode.

  2. Change directory to the repository root folder, OR set the REPO_DIR environment variable:

    export REPO_DIR=~/path/to/repo/ska-sdp-cimg
    
  3. Edit the SLURM script scripts/prod/aws-run-cimg.sbatch if needed (paths, job parameters, number of nodes etc).

    If you require a specific version of the ska-sdp-spack environment:

    export SPACK_TAG="2026.03.2"
    
  4. Submit the SLURM job:

    The script sets up the compute environment using spack and runs the pipeline on a single compute node by default.

    sbatch scripts/prod/aws-run-cimg.sbatch
    

    Output:

    Submitted batch job <job_id>
    

    To run on multiple nodes, override the SLURM directives when submitting the job. It is important to set the number of tasks equal to the total number of nodes.

    sbatch --nodes=3 --ntasks=3 --cpus-per-task=96 scripts/prod/aws-run-cimg.sbatch
    
  5. Check job status:

    squeue
    sacct
    

2. Finishing up

Once the SLURM job has finished, the data products will be available in the latest time-stamped output directory under $PWD/runs for inspection.

Logs will be output to filepaths specified in the slurm script. These include:

  • slurm-<job_name>-<job_id>.out: standard output and error from the job (including output from WSClean which is run by the image task in tasks.image.py)

  • versions-<job_name>-<job_id>.txt: a list of versions of key software used in the job, including spack modules and python packages

  • runs/<job_name>-<job_id>-<timestamp>/monitor-<job_name>-<job_id>/: directory containing per-node benchmarking traces and logs when running on multiple nodes.