.. _aws:

**************
Running on AWS
**************

Follow the instructions below to process a CASA Measurement set with the batch pre-processing pipeline
(BPP) on the AWS DP cluster.

Prerequisites
=============

- Access to the AWS cluster and basic proficiency in its usage: see `SDP Pipelines Cookbook <https://developer.skao.int/projects/ska-sdp-docs/en/latest/index.html>`_. 
- BPP tutorial (:ref:`quickstart`).
- A CASA Measurement Set on the AWS cluster.

Estimated time
==============

10 minutes -- excluding queue wait time and processing time.

Steps
=====

Log into the AWS DP cluster head node and follow the instructions below.

**1. Create a personal directory on the lustre partition**

    Large-scale processing must be done on the shared lustre partition at ``/shared/fsx1``.
    Create yourself a personal directory there if you haven't already:

    .. code-block:: bash

        mkdir -p /shared/fsx1/<MY_USER_NAME>
        cd /shared/fsx1/<MY_USER_NAME>

**2. Create a working directory for the pipeline run**

    Once inside your personal working directory:

    .. code-block:: bash

        mkdir bpp_aws_tutorial
        cd bpp_aws_tutorial

**3. Write a configuration file**

    The pre-processing steps to apply are defined via a YAML configuration file.
    Copy-paste the following to a file named ``config.yaml`` in the current directory.
    Here we keep it simple, just flagging a range of observing frequencies and
    averaging the data.

    .. note::

        Feel free to tweak the flagged frequency range to your particular dataset.

    .. code-block:: yaml

        steps:
            # Flag the 150.00 – 155.42 MHz band
            - step: preflagger
              frequency_ranges_mhz:
                - {start: 150.00, stop: 155.42}
            # Average visibilities in time and frequency by integer factors
            - step: averager
              timestep: 4
              freqstep: 4

**4. Get the BPP SLURM script**

    Copy over the latest `template SLURM script provided in the BPP repository <https://gitlab.com/ska-telescope/sdp/science-pipeline-workflows/ska-sdp-batch-preprocess/-/blob/main/scripts/user/aws_bpp.sh>`_.

    .. code-block:: bash

        wget https://gitlab.com/ska-telescope/sdp/science-pipeline-workflows/ska-sdp-batch-preprocess/-/raw/main/scripts/user/aws_bpp.sh

**5. Submit the SLURM job to run the pipeline**

    The above SLURM script takes arguments in the form of environment variables.
    You can read the documentation inside the script for details.
    Assuming the input is at ``/path/to/my_dataset.ms``, you can submit the job as follows:

    .. code-block:: bash

        sbatch --nodes=1 --time=02:00:00 \
            --export=ALL,WORKERS_PER_NODE=1,DATASET=/path/to/my_dataset.ms \
            aws_bpp.sh


**6. Watch the job progress in the queue**

    .. code-block:: bash

        watch -n5 squeue --me

    You should see a Job ID and a "ST" column that denotes the state of the job.

    - "CF" means "configuring", i.e. a node is being prepared for the run
    - "R" means "running"


**7. Check the pipeline runs**

    Soon after the job starts running, you should see three things created: a base output directory,
    a standard output and a standard error file carrying the Job ID. Assuming you got
    Job ID 4173:

    .. code-block:: bash

        $ ls
        bpp_4173
        bpp.4173.out
        bpp.4173.err

    You can monitor the pipeline run progress by watching the pipeline log file:

    .. code-block:: bash

        $ watch -n5 tail -n40 bpp_4173/output/pipeline.log


**8. Check the pipeline ran successfully**

    Once the run completes, the output directory should contain the following:

    .. code-block:: bash

        $ ls bpp_4173/output

        config.yaml 
        dask-report.html
        my_dataset_flagging_report.png
        my_dataset_flagging_report.zarr
        my_dataset.ms
        pipeline.log
        task-list.json

    That is:

    - The pre-processed visibilities as a CASA Measurement Set with the exact same name as the input
    - An RFI flagging report as an ``xarray`` Dataset object, with a ``.zarr`` extension
    - A summary plot of the RFI flagging report with a ``.png`` extension
    - A copy of the configuration file used for the run
    - Logs of the pipeline
    - Additional diagnostic outputs related to the execution engine Dask

Next steps
==========

To go further, you may want to:

- Read the :ref:`configuration` and learn how to use more advanced steps.
- Learn about the internals of the code: start from :ref:`pipeline_intro`.
- Use frequency chunking with a more compute-heavy configuration that includes the AOFlagger or Demixer steps.
  This will justify the use of more nodes and/or workers per node.