How to monitor a pipeline job on AWS
====================================

This guide describes how to monitor a pipeline job running on the AWS DP HPC cluster using the SLURM
CLI.

Related
-------

- :doc:`How to run a pipeline on AWS <aws-run-pipeline-slurm>`
- `CIMG pipeline can be monitored using the Prefect UI
  <https://developer.skao.int/projects/ska-sdp-cimg/en/latest/usage/aws_slurm.html#monitoring>`_

Prerequisites
-------------

- An account on the AWS DP HPC cluster
- A SLURM job currently running on the cluster

Steps
-----

Use standard SLURM tools to monitor the submitted job while it is running.

1. Identify the job ID from ``squeue`` or the output of ``sbatch``:

       .. code-block:: bash

           squeue -u "$USER"

       To continuously monitor queue status of your own jobs, use ``watch``:

       .. code-block:: bash

           watch -n 1 squeue -u "$USER"

2. Inspect detailed job state and allocated resources:

       .. code-block:: bash

           scontrol show job <job_id>

3. Check accounting information (state, start/end time, exit code):

       .. code-block:: bash

           sacct -j <job_id> --format=JobID,JobName,Partition,AllocCPUS,State,ExitCode,Elapsed,Start,End

4. Follow the SLURM output log in real time (you might need to wait until the status changes from
   configuring (CF) to running (R) before the log file is created):

       .. code-block:: bash

           tail -f slurm-<job_id>.out

       The output log filename is configurable via ``#SBATCH --output`` (or ``sbatch --output``).
       Check the relevant pipeline's SLURM script for the exact pattern, for example ``#SBATCH
       --output=slurm-%x-%j.out``.

5. If you need to cancel the job, use the ``scancel`` command with the job ID:

       .. code-block:: bash

           scancel <job_id>

       Or cancel all of your own jobs:

       .. code-block:: bash

           scancel -u "$USER"

Finishing up
------------

Once the SLURM job has finished, the data products will be available in the output directory. Check
your pipeline's documentation for the expected output location.