User Guide
The recommended way to run rapthor on the SKAO AWS development cluster is to use the rapthor spack module that is pre-installed (you can see details of the spack package here). Loading this module will also load all of rapthor’s dependencies, including wsclean and dp3.
$ module load ska-sdp-spack
$ module load py-rapthor
Rapthor is now ready to run.
Note
We recommend running rapthor as a SLURM job submitted from the headnode. Example SLURM scripts that will set up the required environment variables, run and benchmark rapthor using SKA tools are available for in the scripts/ directory.
Starting a Rapthor run interactively
If you want to run rapthor directly from the command line (for example for testing or debugging), log into a compute node first.
Important
Do not attempt to run rapthor from the headnode, it does not have sufficient resources for compute intensive jobs. If you launch rapthor on the headnode, it will likely crash the cluster and disrupt other users.
To log into a compute node interactively, run the following command from the headnode:
$ srun --nodes=1 --partition=any-7i-24xl-spt --cpus-per-task=96 --ntasks-per-node=1 --time=8:00:00 --pty bash -i
Rapthor can then be run from the command line using:
$ rapthor rapthor.parset
where rapthor.parset is the parset (see rapthor documentation for details).
Warning
Rapthor attempts to resume from a previous state if output files from a previous run are left in the working directory (see resuming a rapthor run). This means that changes to your parset may not be respected unless you remove or rename the previous output folder and delete the contents of your scratch/temporary directories.
Running the ska-sdp-ical pipeline scripts
The recommended method of running Rapthor on the SKAO cluster is to submit a SLURM job from the headnode. The ICAL pipeline repo contains an example SLURM script in scripts/run_ical.sbatch, as well as accompanying parset and strategy files in the config/ folder.
Changing the ICAL / rapthor version
In the run_ical.sbatch script, there are a few environment variables you may wish to adjust to control the version of rapthor and its dependencies.
Variable |
Description |
|---|---|
|
The ska-sdp-spack release to use, e.g., |
|
Path to rapthor repository for development installs. If this variable is set, rapthor will be installed from this path. If not set, rapthor will be loaded from the ska-sdp-spack module as described above. |
|
Branch of rapthor repo to install from if |
|
Path to virtual environment to use for rapthor. If |
Changing the input and output locations
One important way in which the ICAL pipeline scripts differ from running rapthor directly is how the
parset file is used. ICAL uses a template parset file, which contains placeholders for environment
variables that you can define in your SLURM script instead of hard-coding paths in the parset file.
Before rapthor is launched, the parset file is created by substituting the environment variables
into the template parset file. This enables easier testing against different configurations without
needing to maintain multiple parset files. The location of the template parset file can be
controlled by setting the PARSET_PATH environment variable in your SLURM script. The parset that
is used by the pipeline will be written to the working directory as defined by the WORK_PATH
variable with the name rapthor.parset.
The location of the input data set and strategy file, as well as the temporary directories for your run can all be configured through environment variables.
Note
If you change the PARSET_PATH variable to point to your own custom parset file, which does
not include environment variable placeholders (i.e. is not a template file), the environment
variables you set in your SLURM script are ignored.
Variable |
Description |
Default |
|---|---|---|
|
Path to the ska-sdp-ical repository containing the configuration files and scripts. |
|
|
Root output folder for all pipeline runs. This should be a target on the shared FSx storage, rather than in /home which has limited space. |
|
|
Directory where pipeline outputs will be written. Generated using job name and job ID if not provided. |
|
|
Path to the input measurement set to process. |
|
|
Path to the template configuration file (parset) containing all required paths and parameters. |
|
|
Path to the strategy file defining the imaging and calibration strategy for the run. |
|
|
Path to the input skymodel to use for calibration and/or subtraction. This should be a skymodel that is appropriate for the input data and science case you are testing. |
|
|
Temporary directory used by toil for intermediate files created during the run. This will be
used to set |
|
Warning
Due to storage limits on the default /tmp directory on AWS, it is best to create a new
temporary folder on the shared /shared/fsx1 directory. This is necessary because toil/cwl is
used by rapthor to create intermediate files in TMPDIR, during the run which may exceed the
available space on /tmp for long running jobs.
Note
The filter_skymodel step will always set /tmp as the temporary directory. This is a
workaround for socket file paths having a character limit (107 bytes on unix systems), causing
issues with long path names during multiprocessing (used by pybdsf). Since Toil creates path
names for temporary storage files using random hexadecimal strings, the base location of the
temporary storage paths global_scratch_dir and local_scratch_dir can be too long,
resulting in errors.
Once you are satisfied with the configuration, submit the script as a SLURM job using sbatch. You may specify the job name and partition, and any other SLURM parameters you wish. The command below will allocate a single compute node and run all workflows on this node.
$ sbatch --job-name=my-ical-run \
--partition=any-7i-48xl-spt --nodes=1 --cpus-per-task=192 \
run_ical.sbatch
You may also adjust any of the environmental variables described above to control the run, by
passing them to SLURM using the --export argument.
$ sbatch --job-name=my-ical-run \
--partition=any-7i-48xl-spt --nodes=1 --cpus-per-task=192 \
--export=PARSET_PATH=/path/to/your/rapthor.parset,WORK_PATH=/path/to/your/work_dir \
run_ical.sbatch
You may also choose to keep a separate file for your environmental variables and pass this file to
SLURM using the --export-file argument. For example, for a file ical_test_config.sh
containing:
CODE_PATH=/home/user/repos/ska-sdp-ical
WORK_PATH=/shared/fsx1/user/work/pipelines/ical/test_run
PARSET_PATH=/path/to/my/rapthor.parset
STRATEGY_PATH=/path/to/my/strategy.py
TMPDIR=/dev/shm
Note that this file is interpreted as text with key-value pairs and does not support variable expansions and other shell features. You may pass this configuration file to SLURM when you submit the job as follows:
$ sbatch --job-name=my-ical-run \
--partition=any-7i-48xl-spt --nodes=1 --cpus-per-task=192 \
--export-file=ical_test_config.sh \
run_ical.sbatch
Running rapthor on multiple nodes
To run rapthor on multiple nodes, you can pass the --nodes argument to the sbatch command. The
pipeline script will pick up this number from SLURM and set the environmental variable
RAPTHOR_BATCH_SYSTEM to slurm_static for multi-node runs, or single_machine for single
node runs. The slurm_static option assigns the compute resources when submitting the SLURM job
based on the parameters passed to the sbatch command or read from the script header. Rapthor will
then use these pre-allocated nodes for all downstream workflows during the run.
$ sbatch --job-name=ical-multi-node-run \
--partition=any-7i-48xl-spt --nodes=4 --cpus-per-task=192 \
run_ical.sbatch
Benchmarking your runs
To activate resource monitoring via benchmon, you can set the REPORT_PATH
environmental variable in the SLURM script to specify a directory where the benchmarking reports
will be written. If this variable is not set (the default), no benchmarking will be performed.
Variable |
Description |
|---|---|
|
Specify directory for benchmarking monitoring output. If this variable is defined, benchmarking will be enabled. If not defined, no benchmarking will be done. |
|
Optionally specify parameters to pass to benchmon for monitoring. Defaults to “–sys –sys-freq 1 –call –call-prof-freq 1 –save-dir $REPORT_PATH” |
Note
When benchmarking rapthor, it is required to use the slurm_static batch system, since it is
currently not possible to monitor the dynamically allocated nodes.
Note
It is recommended to use the “metal” partitions when benchmarking rapthor runs.
The following command will submit a multi-node rapthor run with benchmarking enabled:
$ sbatch --job-name=ical-multi-node-run \
--partition=any-7i-metal-48xl-spt --nodes=4 --cpus-per-task=192 \
--export=WORK_PATH=/path/to/work_dir,REPORT_PATH=/path/to/work_dir/benchmarks \
run_ical.sbatch
Known issues
When using
batch_system = slurm, the “leader” node will be idle for most of the rapthor run. Toil uses this node to orchestrate the allocation of other nodes. A further node will be idle during imaging steps if mpi is enabled since this node is only used to allocate additional nodes forwsclean-mp. Furthermore, reduced node availability during a run can result in delays in executing downstream workflows.“Argument list too long” errors can occur when the environment is too large. If this situation happens, removing variables like CMAKE_PREFIX_PATH, which are not used at run time, may help avoiding these errors.
$ export ACLOCAL_PATH= $ export CMAKE_PREFIX_PATH= $ export MANPATH= $ export PKG_CONFIG_PATH=
Rationale: Since Rapthor has many (in)direct dependencies, and each dependency adds itself to environment variables like
PATHandPYTHONPATH, the size of the environment grows considerably when loading the Rapthor using eithermodule load py-rapthororspack load py-rapthor.
Notes on resource allocation
There are currently two resource allocation strategies available to distribute rapthor workflows
across multiple compute nodes on the cluster. These are captured by the batch_system parameter in
the parset file. If you are using the template parset file from from the ical repository, this
option will automatically be set to either slurm_static or single_machine based on the
number of allocated nodes. However, the slurm option is also available for some use cases. The
behavior of each option is described below:
slurm_static: This option assigns the compute resources when submitting the SLURM job based on the parameters passed to the sbatch command or read from the script header. Rapthor will use these pre-allocated nodes for all downstream workflows during the run. This is the recommended way of running rapthor on multiple nodes.Warning
When using this option the parameters in the
[cluster]section of the parset file pertaining to node resource allocation (e.g.,max_cores,mem_per_node_gb,cpus_per_task) are ignored since the resources are allocated according to the sbatch parameters at the time the script is submitted.slurm: This option uses Toil’s SLURM batch system to dynamically allocate and release compute nodes as needed during the rapthor run. The partition that Toil uses to allocate nodes can be specified by setting theSALLOC_PARTITIONenvironmental variable.Warning
Ensure you match the
max_coresandmax_threadsto the nodes on the partition(s) you specify in your SLURM script – if you specify more cores than are available rapthor will fail to run.