.. _running_skao: User Guide ========== The recommended way to run rapthor on the SKAO AWS development cluster is to use the rapthor spack module that is pre-installed (you can see details of the spack package `here `_). Loading this module will also load all of rapthor's dependencies, including `wsclean` and `dp3`. .. code-block:: console $ module load ska-sdp-spack $ module load py-rapthor Rapthor is now ready to run. .. note:: We recommend running rapthor as a SLURM job submitted from the headnode. Example SLURM scripts that will set up the required environment variables, run and benchmark rapthor using SKA tools are available for in the `scripts/` directory. .. _starting_rapthor_skao: Starting a Rapthor run interactively ------------------------------------ If you want to run rapthor directly from the command line (for example for testing or debugging), log into a compute node first. .. important:: Do not attempt to run rapthor from the headnode, it does not have sufficient resources for compute intensive jobs. If you launch rapthor on the headnode, it will likely crash the cluster and disrupt other users. To log into a compute node interactively, run the following command from the headnode: .. code-block:: console $ srun --nodes=1 --partition=any-7i-24xl-spt --cpus-per-task=96 --ntasks-per-node=1 --time=8:00:00 --pty bash -i Rapthor can then be run from the command line using: .. code-block:: console $ rapthor rapthor.parset where ``rapthor.parset`` is the parset (see `rapthor documentation for details `_). .. warning:: Rapthor attempts to resume from a previous state if output files from a previous run are left in the working directory (see `resuming a rapthor run `_). This means that changes to your parset may not be respected unless you remove or rename the previous output folder and delete the contents of your scratch/temporary directories. .. _running_ical_slurm: Running the ska-sdp-ical pipeline scripts ----------------------------------------- The recommended method of running Rapthor on the SKAO cluster is to submit a SLURM job from the headnode. The ICAL pipeline repo contains an example SLURM script in `scripts/run_ical.sbatch `_, as well as accompanying parset and strategy files in the `config/ `_ folder. Changing the ICAL / rapthor version ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In the `run_ical.sbatch `_ script, there are a few environment variables you may wish to adjust to control the version of rapthor and its dependencies. .. list-table:: Environment Variables controlling pipeline installation :widths: 20 80 :header-rows: 1 - - Variable - Description - - ``SPACK_TAG`` - The ska-sdp-spack release to use, e.g., ``2025.12.4``. If this is not set, and ``RAPTHOR_PATH`` is not set either, the latest ska-sdp-spack deployment will be used. - - ``RAPTHOR_PATH`` - Path to rapthor repository for development installs. If this variable is set, rapthor will be installed from this path. If not set, rapthor will be loaded from the ska-sdp-spack module as described above. - - ``RAPTHOR_BRANCH`` - Branch of rapthor repo to install from if ``RAPTHOR_PATH`` is set. If not provided, the ``master`` branch is used. This variable is ignored if ``RAPTHOR_PATH`` is not set. - - ``RAPTHOR_VENV`` - Path to virtual environment to use for rapthor. If ``RAPTHOR_VENV`` is set and points to an existing virtual environment, this environment will be used and the installation will be skipped. If the path does not exist, a new virtual environment will be created at this location and used for the installation. Changing the input and output locations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ One important way in which the ICAL pipeline scripts differ from running rapthor directly is how the parset file is used. ICAL uses a template parset file, which contains placeholders for environment variables that you can define in your SLURM script instead of hard-coding paths in the parset file. Before rapthor is launched, the parset file is created by substituting the environment variables into the template parset file. This enables easier testing against different configurations without needing to maintain multiple parset files. The location of the template parset file can be controlled by setting the ``PARSET_PATH`` environment variable in your SLURM script. The parset that is used by the pipeline will be written to the working directory as defined by the ``WORK_PATH`` variable with the name `rapthor.parset`. The location of the input data set and strategy file, as well as the temporary directories for your run can all be configured through environment variables. .. note:: If you change the ``PARSET_PATH`` variable to point to your own custom parset file, which does not include environment variable placeholders (i.e. is not a template file), the environment variables you set in your SLURM script are ignored. .. list-table:: Environment Variables controlling data input and output locations :widths: 20 60 20 :header-rows: 1 - - Variable - Description - Default - - ``CODE_PATH`` - Path to the ska-sdp-ical repository containing the configuration files and scripts. - ``$HOME/repos/ska-sdp-ical`` - - ``OUTPUT_ROOT`` - Root output folder for all pipeline runs. This should be a target on the shared FSx storage, rather than in /home which has limited space. - ``/shared/fsx1/shared/ical-runs/$USER`` - - ``WORK_PATH`` - Directory where pipeline outputs will be written. Generated using job name and job ID if not provided. - ``$OUTPUT_ROOT/${SLURM_JOB_NAME}-${SLURM_JOB_ID}/work`` - - ``INPUT_MS_FULL_PATH`` - Path to the input measurement set to process. - ``/shared/fsx1/shared/ical_workshop/pi28/data/visibility.scan-400_applybeam.ms`` - - ``PARSET_PATH`` - Path to the template configuration file (parset) containing all required paths and parameters. - ``$CODE_PATH/config/example_template.parset`` - - ``STRATEGY_PATH`` - Path to the strategy file defining the imaging and calibration strategy for the run. - ``$CODE_PATH/config/example_strategy.py`` - - ``SKYMODEL_PATH`` - Path to the input skymodel to use for calibration and/or subtraction. This should be a skymodel that is appropriate for the input data and science case you are testing. - ``None`` - - ``TMPDIR`` - Temporary directory used by toil for intermediate files created during the run. This will be used to set ``local_scratch_dir`` and ``global_scratch_dir`` in the parset file. - ``$WORK_PATH/tmp`` .. warning:: Due to storage limits on the default ``/tmp`` directory on AWS, it is best to create a new temporary folder on the shared ``/shared/fsx1`` directory. This is necessary because toil/cwl is used by rapthor to create intermediate files in ``TMPDIR``, during the run which may exceed the available space on ``/tmp`` for long running jobs. .. note:: The filter_skymodel step will always set ``/tmp`` as the temporary directory. This is a workaround for socket file paths having a character limit (107 bytes on unix systems), causing issues with long path names during multiprocessing (used by pybdsf). Since Toil creates path names for temporary storage files using random hexadecimal strings, the base location of the temporary storage paths ``global_scratch_dir`` and ``local_scratch_dir`` can be too long, resulting in errors. Once you are satisfied with the configuration, submit the script as a SLURM job using `sbatch `_. You may specify the job name and partition, and any other SLURM parameters you wish. The command below will allocate a single compute node and run all workflows on this node. .. code-block:: console $ sbatch --job-name=my-ical-run \ --partition=any-7i-48xl-spt --nodes=1 --cpus-per-task=192 \ run_ical.sbatch You may also adjust any of the environmental variables described above to control the run, by passing them to SLURM using the ``--export`` argument. .. code-block:: console $ sbatch --job-name=my-ical-run \ --partition=any-7i-48xl-spt --nodes=1 --cpus-per-task=192 \ --export=PARSET_PATH=/path/to/your/rapthor.parset,WORK_PATH=/path/to/your/work_dir \ run_ical.sbatch You may also choose to keep a separate file for your environmental variables and pass this file to SLURM using the ``--export-file`` argument. For example, for a file `ical_test_config.sh` containing: .. code-block:: bash CODE_PATH=/home/user/repos/ska-sdp-ical WORK_PATH=/shared/fsx1/user/work/pipelines/ical/test_run PARSET_PATH=/path/to/my/rapthor.parset STRATEGY_PATH=/path/to/my/strategy.py TMPDIR=/dev/shm Note that this file is interpreted as text with key-value pairs and does not support variable expansions and other shell features. You may pass this configuration file to SLURM when you submit the job as follows: .. code-block:: console $ sbatch --job-name=my-ical-run \ --partition=any-7i-48xl-spt --nodes=1 --cpus-per-task=192 \ --export-file=ical_test_config.sh \ run_ical.sbatch Running rapthor on multiple nodes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To run rapthor on multiple nodes, you can pass the ``--nodes`` argument to the `sbatch` command. The pipeline script will pick up this number from SLURM and set the environmental variable ``RAPTHOR_BATCH_SYSTEM`` to ``slurm_static`` for multi-node runs, or ``single_machine`` for single node runs. The ``slurm_static`` option assigns the compute resources when submitting the SLURM job based on the parameters passed to the sbatch command or read from the script header. Rapthor will then use these pre-allocated nodes for all downstream workflows during the run. .. code-block:: console $ sbatch --job-name=ical-multi-node-run \ --partition=any-7i-48xl-spt --nodes=4 --cpus-per-task=192 \ run_ical.sbatch Benchmarking your runs ~~~~~~~~~~~~~~~~~~~~~~ To activate resource monitoring via `benchmon `_, you can set the ``REPORT_PATH`` environmental variable in the SLURM script to specify a directory where the benchmarking reports will be written. If this variable is not set (the default), no benchmarking will be performed. .. list-table:: Environment Variables controlling benchmark monitoring :widths: 20 80 :header-rows: 1 - - Variable - Description - - ``REPORT_PATH`` - Specify directory for benchmarking monitoring output. If this variable is defined, benchmarking will be enabled. If not defined, no benchmarking will be done. - - ``BENCHMON_PARAMS`` - Optionally specify parameters to pass to benchmon for monitoring. Defaults to `"--sys --sys-freq 1 --call --call-prof-freq 1 --save-dir $REPORT_PATH"` .. note:: When benchmarking rapthor, it is required to use the ``slurm_static`` batch system, since it is currently not possible to monitor the dynamically allocated nodes. .. note:: It is recommended to use the "metal" partitions when benchmarking rapthor runs. The following command will submit a multi-node rapthor run with benchmarking enabled: .. code-block:: console $ sbatch --job-name=ical-multi-node-run \ --partition=any-7i-metal-48xl-spt --nodes=4 --cpus-per-task=192 \ --export=WORK_PATH=/path/to/work_dir,REPORT_PATH=/path/to/work_dir/benchmarks \ run_ical.sbatch Known issues ------------ - When using ``batch_system = slurm``, the "leader" node will be idle for most of the rapthor run. Toil uses this node to orchestrate the allocation of other nodes. A further node will be idle during imaging steps if mpi is enabled since this node is only used to allocate additional nodes for ``wsclean-mp``. Furthermore, reduced node availability during a run can result in delays in executing downstream workflows. - "Argument list too long" errors can occur when the environment is too large. If this situation happens, removing variables like `CMAKE_PREFIX_PATH`, which are not used at run time, may help avoiding these errors. .. code-block:: console $ export ACLOCAL_PATH= $ export CMAKE_PREFIX_PATH= $ export MANPATH= $ export PKG_CONFIG_PATH= Rationale: Since Rapthor has many (in)direct dependencies, and each dependency adds itself to environment variables like ``PATH`` and ``PYTHONPATH``, the size of the environment grows considerably when loading the Rapthor using either ``module load py-rapthor`` or ``spack load py-rapthor``. Notes on resource allocation ---------------------------- There are currently two resource allocation strategies available to distribute rapthor workflows across multiple compute nodes on the cluster. These are captured by the `batch_system` parameter in the parset file. If you are using the template parset file from from the ical repository, this option will automatically be set to either ``slurm_static`` or ``single_machine`` based on the number of allocated nodes. However, the ``slurm`` option is also available for some use cases. The behavior of each option is described below: - ``slurm_static``: This option assigns the compute resources when submitting the SLURM job based on the parameters passed to the sbatch command or read from the script header. Rapthor will use these pre-allocated nodes for all downstream workflows during the run. This is the recommended way of running rapthor on multiple nodes. .. warning:: When using this option the parameters in the ``[cluster]`` section of the parset file pertaining to node resource allocation (e.g., ``max_cores``, ``mem_per_node_gb``, ``cpus_per_task``) are ignored since the resources are allocated according to the sbatch parameters at the time the script is submitted. - ``slurm``: This option uses Toil's SLURM batch system to dynamically allocate and release compute nodes as needed during the rapthor run. The partition that Toil uses to allocate nodes can be specified by setting the ``SALLOC_PARTITION`` environmental variable. .. warning:: Ensure you match the ``max_cores`` and ``max_threads`` to the nodes on the partition(s) you specify in your SLURM script -- if you specify more cores than are available rapthor will fail to run.