User Guide

The recommended way to run rapthor on the SKAO AWS development cluster is to use the rapthor spack module that is pre-installed (you can see details of the spack package here). Loading this module will also load all of rapthor’s dependencies, including wsclean and dp3.

$ module load ska-sdp-spack
$ module load py-rapthor

Rapthor is now ready to run.

Note

We recommend running rapthor as a SLURM job submitted from the headnode. Example SLURM scripts that will set up the required environment variables, run and benchmark rapthor using SKA tools are available for in the scripts/ directory.

Starting a Rapthor run interactively

If you want to run rapthor directly from the command line (for example for testing or debugging), log into a compute node first.

Important

Do not attempt to run rapthor from the headnode, it does not have sufficient resources for compute intensive jobs. If you launch rapthor on the headnode, it will likely crash the cluster and disrupt other users.

To log into a compute node interactively, run the following command from the headnode:

$ srun --nodes=1 --partition=any-7i-24xl-spt --cpus-per-task=96 --ntasks-per-node=1 --time=8:00:00 --pty bash -i

Rapthor can then be run from the command line using:

$ rapthor rapthor.parset

where rapthor.parset is the parset (see rapthor documentation for details).

Warning

Rapthor attempts to resume from a previous state if output files from a previous run are left in the working directory (see resuming a rapthor run). This means that changes to your parset may not be respected unless you remove or rename the previous output folder and delete the contents of your scratch/temporary directories.

Running the ska-sdp-ical pipeline scripts

The recommended method of running Rapthor on the SKAO cluster is to submit a SLURM job from the headnode. The ICAL pipeline repo contains an example SLURM script in scripts/run_ical.sbatch, as well as accompanying parset and strategy files in the config/ folder.

Changing the ICAL / rapthor version

In the run_ical.sbatch script, there are a few environment variables you may wish to adjust to control the version of rapthor and its dependencies.

Environment Variables controlling pipeline installation

Variable

Description

SPACK_TAG

The ska-sdp-spack release to use, e.g., 2025.12.4. If this is not set, and RAPTHOR_PATH is not set either, the latest ska-sdp-spack deployment will be used.

RAPTHOR_PATH

Path to rapthor repository for development installs. If this variable is set, rapthor will be installed from this path. If not set, rapthor will be loaded from the ska-sdp-spack module as described above.

RAPTHOR_BRANCH

Branch of rapthor repo to install from if RAPTHOR_PATH is set. If not provided, the master branch is used. This variable is ignored if RAPTHOR_PATH is not set.

RAPTHOR_VENV

Path to virtual environment to use for rapthor. If RAPTHOR_VENV is set and points to an existing virtual environment, this environment will be used and the installation will be skipped. If the path does not exist, a new virtual environment will be created at this location and used for the installation.

Changing the input and output locations

One important way in which the ICAL pipeline scripts differ from running rapthor directly is how the parset file is used. ICAL uses a template parset file, which contains placeholders for environment variables that you can define in your SLURM script instead of hard-coding paths in the parset file. Before rapthor is launched, the parset file is created by substituting the environment variables into the template parset file. This enables easier testing against different configurations without needing to maintain multiple parset files. The location of the template parset file can be controlled by setting the PARSET_PATH environment variable in your SLURM script. The parset that is used by the pipeline will be written to the working directory as defined by the WORK_PATH variable with the name rapthor.parset.

The location of the input data set and strategy file, as well as the temporary directories for your run can all be configured through environment variables.

Note

If you change the PARSET_PATH variable to point to your own custom parset file, which does not include environment variable placeholders (i.e. is not a template file), the environment variables you set in your SLURM script are ignored.

Environment Variables controlling data input and output locations

Variable

Description

Default

CODE_PATH

Path to the ska-sdp-ical repository containing the configuration files and scripts.

$HOME/repos/ska-sdp-ical

OUTPUT_ROOT

Root output folder for all pipeline runs. This should be a target on the shared FSx storage, rather than in /home which has limited space.

/shared/fsx1/shared/ical-runs/$USER

WORK_PATH

Directory where pipeline outputs will be written. Generated using job name and job ID if not provided.

$OUTPUT_ROOT/${SLURM_JOB_NAME}-${SLURM_JOB_ID}/work

INPUT_MS_FULL_PATH

Path to the input measurement set to process.

/shared/fsx1/shared/ical_workshop/pi28/data/visibility.scan-400_applybeam.ms

PARSET_PATH

Path to the template configuration file (parset) containing all required paths and parameters.

$CODE_PATH/config/example_template.parset

STRATEGY_PATH

Path to the strategy file defining the imaging and calibration strategy for the run.

$CODE_PATH/config/example_strategy.py

SKYMODEL_PATH

Path to the input skymodel to use for calibration and/or subtraction. This should be a skymodel that is appropriate for the input data and science case you are testing.

None

TMPDIR

Temporary directory used by toil for intermediate files created during the run. This will be used to set local_scratch_dir and global_scratch_dir in the parset file.

$WORK_PATH/tmp

Warning

Due to storage limits on the default /tmp directory on AWS, it is best to create a new temporary folder on the shared /shared/fsx1 directory. This is necessary because toil/cwl is used by rapthor to create intermediate files in TMPDIR, during the run which may exceed the available space on /tmp for long running jobs.

Note

The filter_skymodel step will always set /tmp as the temporary directory. This is a workaround for socket file paths having a character limit (107 bytes on unix systems), causing issues with long path names during multiprocessing (used by pybdsf). Since Toil creates path names for temporary storage files using random hexadecimal strings, the base location of the temporary storage paths global_scratch_dir and local_scratch_dir can be too long, resulting in errors.

Once you are satisfied with the configuration, submit the script as a SLURM job using sbatch. You may specify the job name and partition, and any other SLURM parameters you wish. The command below will allocate a single compute node and run all workflows on this node.

$ sbatch --job-name=my-ical-run \
    --partition=any-7i-48xl-spt --nodes=1 --cpus-per-task=192 \
    run_ical.sbatch

You may also adjust any of the environmental variables described above to control the run, by passing them to SLURM using the --export argument.

$ sbatch --job-name=my-ical-run \
    --partition=any-7i-48xl-spt --nodes=1 --cpus-per-task=192 \
    --export=PARSET_PATH=/path/to/your/rapthor.parset,WORK_PATH=/path/to/your/work_dir \
    run_ical.sbatch

You may also choose to keep a separate file for your environmental variables and pass this file to SLURM using the --export-file argument. For example, for a file ical_test_config.sh containing:

CODE_PATH=/home/user/repos/ska-sdp-ical
WORK_PATH=/shared/fsx1/user/work/pipelines/ical/test_run
PARSET_PATH=/path/to/my/rapthor.parset
STRATEGY_PATH=/path/to/my/strategy.py
TMPDIR=/dev/shm

Note that this file is interpreted as text with key-value pairs and does not support variable expansions and other shell features. You may pass this configuration file to SLURM when you submit the job as follows:

$ sbatch --job-name=my-ical-run \
    --partition=any-7i-48xl-spt --nodes=1 --cpus-per-task=192 \
    --export-file=ical_test_config.sh \
    run_ical.sbatch

Running rapthor on multiple nodes

To run rapthor on multiple nodes, you can pass the --nodes argument to the sbatch command. The pipeline script will pick up this number from SLURM and set the environmental variable RAPTHOR_BATCH_SYSTEM to slurm_static for multi-node runs, or single_machine for single node runs. The slurm_static option assigns the compute resources when submitting the SLURM job based on the parameters passed to the sbatch command or read from the script header. Rapthor will then use these pre-allocated nodes for all downstream workflows during the run.

$ sbatch --job-name=ical-multi-node-run \
    --partition=any-7i-48xl-spt --nodes=4 --cpus-per-task=192 \
    run_ical.sbatch

Benchmarking your runs

To activate resource monitoring via benchmon, you can set the REPORT_PATH environmental variable in the SLURM script to specify a directory where the benchmarking reports will be written. If this variable is not set (the default), no benchmarking will be performed.

Environment Variables controlling benchmark monitoring

Variable

Description

REPORT_PATH

Specify directory for benchmarking monitoring output. If this variable is defined, benchmarking will be enabled. If not defined, no benchmarking will be done.

BENCHMON_PARAMS

Optionally specify parameters to pass to benchmon for monitoring. Defaults to “–sys –sys-freq 1 –call –call-prof-freq 1 –save-dir $REPORT_PATH”

Note

When benchmarking rapthor, it is required to use the slurm_static batch system, since it is currently not possible to monitor the dynamically allocated nodes.

Note

It is recommended to use the “metal” partitions when benchmarking rapthor runs.

The following command will submit a multi-node rapthor run with benchmarking enabled:

$ sbatch --job-name=ical-multi-node-run \
    --partition=any-7i-metal-48xl-spt --nodes=4 --cpus-per-task=192 \
    --export=WORK_PATH=/path/to/work_dir,REPORT_PATH=/path/to/work_dir/benchmarks \
    run_ical.sbatch

Known issues

  • When using batch_system = slurm, the “leader” node will be idle for most of the rapthor run. Toil uses this node to orchestrate the allocation of other nodes. A further node will be idle during imaging steps if mpi is enabled since this node is only used to allocate additional nodes for wsclean-mp. Furthermore, reduced node availability during a run can result in delays in executing downstream workflows.

  • “Argument list too long” errors can occur when the environment is too large. If this situation happens, removing variables like CMAKE_PREFIX_PATH, which are not used at run time, may help avoiding these errors.

    $ export ACLOCAL_PATH=
    $ export CMAKE_PREFIX_PATH=
    $ export MANPATH=
    $ export PKG_CONFIG_PATH=
    

    Rationale: Since Rapthor has many (in)direct dependencies, and each dependency adds itself to environment variables like PATH and PYTHONPATH, the size of the environment grows considerably when loading the Rapthor using either module load py-rapthor or spack load py-rapthor.

Notes on resource allocation

There are currently two resource allocation strategies available to distribute rapthor workflows across multiple compute nodes on the cluster. These are captured by the batch_system parameter in the parset file. If you are using the template parset file from from the ical repository, this option will automatically be set to either slurm_static or single_machine based on the number of allocated nodes. However, the slurm option is also available for some use cases. The behavior of each option is described below:

  • slurm_static: This option assigns the compute resources when submitting the SLURM job based on the parameters passed to the sbatch command or read from the script header. Rapthor will use these pre-allocated nodes for all downstream workflows during the run. This is the recommended way of running rapthor on multiple nodes.

    Warning

    When using this option the parameters in the [cluster] section of the parset file pertaining to node resource allocation (e.g., max_cores, mem_per_node_gb, cpus_per_task) are ignored since the resources are allocated according to the sbatch parameters at the time the script is submitted.

  • slurm: This option uses Toil’s SLURM batch system to dynamically allocate and release compute nodes as needed during the rapthor run. The partition that Toil uses to allocate nodes can be specified by setting the SALLOC_PARTITION environmental variable.

    Warning

    Ensure you match the max_cores and max_threads to the nodes on the partition(s) you specify in your SLURM script – if you specify more cores than are available rapthor will fail to run.