.. _running_skao:

User Guide
==========

The recommended way to run rapthor on the SKAO AWS development cluster is to use the rapthor spack
module that is pre-installed (you can see details of the spack package `here
<https://gitlab.com/ska-telescope/sdp/ska-sdp-spack/-/blob/main/packages/py-rapthor/package.py>`_).
Loading this module will also load all of rapthor's dependencies, including `wsclean` and `dp3`.

.. code-block:: console

    $ module load ska-sdp-spack
    $ module load py-rapthor

Rapthor is now ready to run.

.. note::

    We recommend running rapthor as a SLURM job submitted from the headnode. Example SLURM scripts
    that will set up the required environment variables, run and benchmark rapthor using SKA tools
    are available for in the `scripts/` directory.

.. _starting_rapthor_skao:

Starting a Rapthor run interactively
------------------------------------

If you want to run rapthor directly from the command line (for example for testing or debugging),
log into a compute node first.

.. important::

    Do not attempt to run rapthor from the headnode, it does not have sufficient resources for
    compute intensive jobs. If you launch rapthor on the headnode, it will likely crash the cluster
    and disrupt other users.

To log into a compute node interactively, run the following command from the headnode:

.. code-block:: console

    $ srun --nodes=1 --partition=any-7i-24xl-spt --cpus-per-task=96 --ntasks-per-node=1 --time=8:00:00 --pty bash -i

Rapthor can then be run from the command line using:

.. code-block:: console

    $ rapthor rapthor.parset

where ``rapthor.parset`` is the parset (see `rapthor documentation for details
<https://rapthor.readthedocs.io/en/latest/parset.html>`_).

.. warning::

    Rapthor attempts to resume from a previous state if output files from a previous run are left in
    the working directory (see `resuming a rapthor run
    <https://rapthor.readthedocs.io/en/latest/running.html#resuming-an-interrupted-run>`_). This
    means that changes to your parset may not be respected unless you remove or rename the previous
    output folder and delete the contents of your scratch/temporary directories.

.. _running_ical_slurm:

Running the ska-sdp-ical pipeline scripts
-----------------------------------------

The recommended method of running Rapthor on the SKAO cluster is to submit a SLURM job from the
headnode. The ICAL pipeline repo contains an example SLURM script in `scripts/run_ical.sbatch
<https://gitlab.com/ska-telescope/sdp/science-pipeline-workflows/ska-sdp-ical/-/blob/main/scripts/run_ical.sbatch>`_,
as well as accompanying parset and strategy files in the `config/
<https://gitlab.com/ska-telescope/sdp/science-pipeline-workflows/ska-sdp-ical/-/tree/main/config?ref_type=heads>`_
folder.

Changing the ICAL / rapthor version
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In the `run_ical.sbatch
<https://gitlab.com/ska-telescope/sdp/science-pipeline-workflows/ska-sdp-ical/-/blob/main/scripts/run_ical.sbatch>`_
script, there are a few environment variables you may wish to adjust to control the version of
rapthor and its dependencies.

.. list-table:: Environment Variables controlling pipeline installation
    :widths: 20 80
    :header-rows: 1

    - - Variable
      - Description
    - - ``SPACK_TAG``
      - The ska-sdp-spack release to use, e.g., ``2025.12.4``. If this is not set, and
        ``RAPTHOR_PATH`` is not set either, the latest ska-sdp-spack deployment will be used.
    - - ``RAPTHOR_PATH``
      - Path to rapthor repository for development installs. If this variable is set, rapthor will
        be installed from this path. If not set, rapthor will be loaded from the ska-sdp-spack
        module as described above.
    - - ``RAPTHOR_BRANCH``
      - Branch of rapthor repo to install from if ``RAPTHOR_PATH`` is set. If not provided, the
        ``master`` branch is used. This variable is ignored if ``RAPTHOR_PATH`` is not set.
    - - ``RAPTHOR_VENV``
      - Path to virtual environment to use for rapthor. If ``RAPTHOR_VENV`` is set and points to an
        existing virtual environment, this environment will be used and the installation will be
        skipped. If the path does not exist, a new virtual environment will be created at this
        location and used for the installation.

Changing the input and output locations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

One important way in which the ICAL pipeline scripts differ from running rapthor directly is how the
parset file is used. ICAL uses a template parset file, which contains placeholders for environment
variables that you can define in your SLURM script instead of hard-coding paths in the parset file.
Before rapthor is launched, the parset file is created by substituting the environment variables
into the template parset file. This enables easier testing against different configurations without
needing to maintain multiple parset files. The location of the template parset file can be
controlled by setting the ``PARSET_PATH`` environment variable in your SLURM script. The parset that
is used by the pipeline will be written to the working directory as defined by the ``WORK_PATH``
variable with the name `rapthor.parset`.

The location of the input data set and strategy file, as well as the temporary directories for your
run can all be configured through environment variables.

.. note::

    If you change the ``PARSET_PATH`` variable to point to your own custom parset file, which does
    not include environment variable placeholders (i.e. is not a template file), the environment
    variables you set in your SLURM script are ignored.

.. list-table:: Environment Variables controlling data input and output locations
    :widths: 20 60 20
    :header-rows: 1

    - - Variable
      - Description
      - Default
    - - ``CODE_PATH``
      - Path to the ska-sdp-ical repository containing the configuration files and scripts.
      - ``$HOME/repos/ska-sdp-ical``
    - - ``OUTPUT_ROOT``
      - Root output folder for all pipeline runs. This should be a target on the shared FSx storage,
        rather than in /home which has limited space.
      - ``/shared/fsx1/shared/ical-runs/$USER``
    - - ``WORK_PATH``
      - Directory where pipeline outputs will be written. Generated using job name and job ID if not
        provided.
      - ``$OUTPUT_ROOT/${SLURM_JOB_NAME}-${SLURM_JOB_ID}/work``
    - - ``INPUT_MS_FULL_PATH``
      - Path to the input measurement set to process.
      - ``/shared/fsx1/shared/ical_workshop/pi28/data/visibility.scan-400_applybeam.ms``
    - - ``PARSET_PATH``
      - Path to the template configuration file (parset) containing all required paths and
        parameters.
      - ``$CODE_PATH/config/example_template.parset``
    - - ``STRATEGY_PATH``
      - Path to the strategy file defining the imaging and calibration strategy for the run.
      - ``$CODE_PATH/config/example_strategy.py``
    - - ``SKYMODEL_PATH``
      - Path to the input skymodel to use for calibration and/or subtraction. This should be a
        skymodel that is appropriate for the input data and science case you are testing.
      - ``None``
    - - ``TMPDIR``
      - Temporary directory used by toil for intermediate files created during the run. This will be
        used to set ``local_scratch_dir`` and ``global_scratch_dir`` in the parset file.
      - ``$WORK_PATH/tmp``

.. warning::

    Due to storage limits on the default ``/tmp`` directory on AWS, it is best to create a new
    temporary folder on the shared ``/shared/fsx1`` directory. This is necessary because toil/cwl is
    used by rapthor to create intermediate files in ``TMPDIR``, during the run which may exceed the
    available space on ``/tmp`` for long running jobs.

.. note::

    The filter_skymodel step will always set ``/tmp`` as the temporary directory. This is a
    workaround for socket file paths having a character limit (107 bytes on unix systems), causing
    issues with long path names during multiprocessing (used by pybdsf). Since Toil creates path
    names for temporary storage files using random hexadecimal strings, the base location of the
    temporary storage paths ``global_scratch_dir`` and ``local_scratch_dir`` can be too long,
    resulting in errors.

Once you are satisfied with the configuration, submit the script as a SLURM job using `sbatch
<https://slurm.schedmd.com/sbatch.html>`_. You may specify the job name and partition, and any other
SLURM parameters you wish. The command below will allocate a single compute node and run all
workflows on this node.

.. code-block:: console

    $ sbatch --job-name=my-ical-run \
        --partition=any-7i-48xl-spt --nodes=1 --cpus-per-task=192 \
        run_ical.sbatch

You may also adjust any of the environmental variables described above to control the run, by
passing them to SLURM using the ``--export`` argument.

.. code-block:: console

    $ sbatch --job-name=my-ical-run \
        --partition=any-7i-48xl-spt --nodes=1 --cpus-per-task=192 \
        --export=PARSET_PATH=/path/to/your/rapthor.parset,WORK_PATH=/path/to/your/work_dir \
        run_ical.sbatch

You may also choose to keep a separate file for your environmental variables and pass this file to
SLURM using the ``--export-file`` argument. For example, for a file `ical_test_config.sh`
containing:

.. code-block:: bash

    CODE_PATH=/home/user/repos/ska-sdp-ical
    WORK_PATH=/shared/fsx1/user/work/pipelines/ical/test_run
    PARSET_PATH=/path/to/my/rapthor.parset
    STRATEGY_PATH=/path/to/my/strategy.py
    TMPDIR=/dev/shm

Note that this file is interpreted as text with key-value pairs and does not support variable
expansions and other shell features. You may pass this configuration file to SLURM when you submit
the job as follows:

.. code-block:: console

    $ sbatch --job-name=my-ical-run \
        --partition=any-7i-48xl-spt --nodes=1 --cpus-per-task=192 \
        --export-file=ical_test_config.sh \
        run_ical.sbatch

Running rapthor on multiple nodes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To run rapthor on multiple nodes, you can pass the ``--nodes`` argument to the `sbatch` command. The
pipeline script will pick up this number from SLURM and set the environmental variable
``RAPTHOR_BATCH_SYSTEM`` to ``slurm_static`` for multi-node runs, or ``single_machine`` for single
node runs. The ``slurm_static`` option assigns the compute resources when submitting the SLURM job
based on the parameters passed to the sbatch command or read from the script header. Rapthor will
then use these pre-allocated nodes for all downstream workflows during the run.

.. code-block:: console

    $ sbatch --job-name=ical-multi-node-run \
        --partition=any-7i-48xl-spt --nodes=4 --cpus-per-task=192 \
        run_ical.sbatch

Benchmarking your runs
~~~~~~~~~~~~~~~~~~~~~~

To activate resource monitoring via `benchmon
<https://gitlab.com/ska-telescope/sdp/ska-sdp-benchmark-monitor>`_, you can set the ``REPORT_PATH``
environmental variable in the SLURM script to specify a directory where the benchmarking reports
will be written. If this variable is not set (the default), no benchmarking will be performed.

.. list-table:: Environment Variables controlling benchmark monitoring
    :widths: 20 80
    :header-rows: 1

    - - Variable
      - Description
    - - ``REPORT_PATH``
      - Specify directory for benchmarking monitoring output. If this variable is defined,
        benchmarking will be enabled. If not defined, no benchmarking will be done.
    - - ``BENCHMON_PARAMS``
      - Optionally specify parameters to pass to benchmon for monitoring. Defaults to `"--sys
        --sys-freq 1 --call --call-prof-freq 1 --save-dir $REPORT_PATH"`

.. note::

    When benchmarking rapthor, it is required to use the ``slurm_static`` batch system, since it is
    currently not possible to monitor the dynamically allocated nodes.

.. note::

    It is recommended to use the "metal" partitions when benchmarking rapthor runs.

The following command will submit a multi-node rapthor run with benchmarking enabled:

.. code-block:: console

    $ sbatch --job-name=ical-multi-node-run \
        --partition=any-7i-metal-48xl-spt --nodes=4 --cpus-per-task=192 \
        --export=WORK_PATH=/path/to/work_dir,REPORT_PATH=/path/to/work_dir/benchmarks \
        run_ical.sbatch

Known issues
------------

- When using ``batch_system = slurm``, the "leader" node will be idle for most of the rapthor run.
  Toil uses this node to orchestrate the allocation of other nodes. A further node will be idle
  during imaging steps if mpi is enabled since this node is only used to allocate additional nodes
  for ``wsclean-mp``. Furthermore, reduced node availability during a run can result in delays in
  executing downstream workflows.
- "Argument list too long" errors can occur when the environment is too large. If this situation
  happens, removing variables like `CMAKE_PREFIX_PATH`, which are not used at run time, may help
  avoiding these errors.

  .. code-block:: console

      $ export ACLOCAL_PATH=
      $ export CMAKE_PREFIX_PATH=
      $ export MANPATH=
      $ export PKG_CONFIG_PATH=

  Rationale: Since Rapthor has many (in)direct dependencies, and each dependency adds itself to
  environment variables like ``PATH`` and ``PYTHONPATH``, the size of the environment grows
  considerably when loading the Rapthor using either ``module load py-rapthor`` or ``spack load
  py-rapthor``.

Notes on resource allocation
----------------------------

There are currently two resource allocation strategies available to distribute rapthor workflows
across multiple compute nodes on the cluster. These are captured by the `batch_system` parameter in
the parset file. If you are using the template parset file from from the ical repository, this
option will automatically be set to either ``slurm_static`` or ``single_machine`` based on the
number of allocated nodes. However, the ``slurm`` option is also available for some use cases. The
behavior of each option is described below:

- ``slurm_static``: This option assigns the compute resources when submitting the SLURM job based on
  the parameters passed to the sbatch command or read from the script header. Rapthor will use these
  pre-allocated nodes for all downstream workflows during the run. This is the recommended way of
  running rapthor on multiple nodes.

  .. warning::

      When using this option the parameters in the ``[cluster]`` section of the parset file
      pertaining to node resource allocation (e.g., ``max_cores``, ``mem_per_node_gb``,
      ``cpus_per_task``) are ignored since the resources are allocated according to the sbatch
      parameters at the time the script is submitted.

- ``slurm``: This option uses Toil's SLURM batch system to dynamically allocate and release compute
  nodes as needed during the rapthor run. The partition that Toil uses to allocate nodes can be
  specified by setting the ``SALLOC_PARTITION`` environmental variable.

  .. warning::

      Ensure you match the ``max_cores`` and ``max_threads`` to the nodes on the partition(s) you
      specify in your SLURM script -- if you specify more cores than are available rapthor will fail
      to run.