# SDP E2E Batch Pipeline Scripts

> This page describes the older way of running the end-to-end pipeline using slurm (bash) scripts.

This `ska_sdp_e2e_batch_continuum_imaging` contains the batch pipeline (slurm/bash) scripts which can be used to run the SKA SDP batch pipelines and the SKA SDP batch end-to-end pipeline. This documentation gives general instructions about how to run the scripts on any infrastructure such as HPC or non-HPC, with or without using spack packages etc. For instructions specfic to the **SKA DP HPC Cluster** (currently hosted on AWS), please visit the relevent [confluence page](https://confluence.skatelescope.org/x/gGmUEg).

## Structure of the package

1. **config**: Contains configuration files for all pipelines.
1. **data**: Contains some extra input data needed for the pipelines.
1. **env**: Contains scripts to setup runtime environment for pipeline scripts.
1. **packages**: Contains setup and config files, used to install pipelines in python virtual environments
1. **scripts**: Contains all the pipeline scripts.

## Descriptions of the scripts

1. **cal_bpp.sh**: Runs [Batch Preprocessing Pipeline](https://gitlab.com/ska-telescope/sdp/science-pipeline-workflows/ska-sdp-batch-preprocess) pipeline on the calibrator source
1. **inst.sh**: Runs [Instrumental Calibration Pipeline](https://gitlab.com/ska-telescope/sdp/science-pipeline-workflows/ska-sdp-instrumental-calibration) on the pre-processed calibrator source i.e. on the output of `cal_bpp.sh`.
1. **target_bpp.sh**: Runs [Batch Preprocessing Pipeline](https://gitlab.com/ska-telescope/sdp/science-pipeline-workflows/ska-sdp-batch-preprocess) pipeline on the target source
1. **rapthor.sh**: Runs [Rapthor](https://git.astron.nl/RD/rapthor) on pre-processed calibrator source i.e. on the output of `target_bpp.sh`.
1. **e2e.sh**: Runs end-to-end batch continuum imaging pipeline. Internally it executes above scripts in a sequential order.

The scripts use [batchlet](https://developer.skao.int/projects/ska-sdp-exec-batchlet/en/latest/batchlet.html) tool to:
1. Setup the dask cluster for each sub-pipeline in its environment
2. Optionally, Run resource monitoring (CPU, Memory) using [ska-sdp-benchmark-monitor](https://gitlab.com/ska-telescope/sdp/ska-sdp-benchmark-monitor/-/tree/main?ref_type=heads).

Even though the scripts inside `./scripts` directory are primarily made to run as slurm jobs, these can be run as regular bash scripts. Therefore, *you can run all the pipelines locally, provided that*:
- You have setup the environment (spacks / python environment) needed to run the pipelines. For this, please read [Running the scripts using spack](#running-the-scripts-using-spack-recommended) or [Local Virtual Environment Setup](#local-virtual-environment-setup) section.
- You have necessary input data available locally (measurement sets, sky model files). For input data apart from measurement sets, please read [Getting Input Data](#getting-input-data) section.
- (Optionally) You modify the `dask_params` section under `batchlet` command in the scripts, according to your system specifications.

Out of all the scripts, `e2e.sh` script should run out of the box, after sourcing one of the `.envrc` file depending on the environment.
The `e2e.sh` scripts takes care of the dependencies between the outputs of pipelines.
For other scripts, since some pipelines are dependent on the output of previous pipelines, you may have to set some extra environment variables.

## Running the scripts using spack (recommended)

If you have `spack` on your machine, you can use spack to install all the pipelines and the scripts, and use the scripts directly. This will ensure consistency across all the environments (i.e. you will have the exact same packages as it would be on SKA's staging / production infrastructure).

The spack packages for the end-to-end pipeline and all the sub-pipelines are part of the [ska-sdp-spack](https://gitlab.com/ska-telescope/sdp/ska-sdp-spack). You can follow the [README](https://gitlab.com/ska-telescope/sdp/ska-sdp-spack/-/blob/main/README.md) to setup spack on your machine. Then install the e2e pipeline using this command:

```bash
spack install ska-sdp-e2e-batch-continuum-imaging
```

This will install all the scripts and default configuration files in the standard spack installation directories.
**This doesn't install any of the sub-pipelines.** To install them, please run

```bash
spack install py-ska-sdp-batch-preprocess \
    py-ska-sdp-instrumental-calibration \
    py-rapthor
```

Since the end-to-end scripts rely on [environment modules](https://modules.readthedocs.io/en/latest/) to update the runtime environment `*PATH•` variables, you need to generate modulefiles from spack packages. For this, you need to run:

```bash
spack module tcl refresh -y
```

**NOTE:** You need to run above command to re-generate the modulefiles once you reinstall any of the pipelines.

To ensure the modulefiles are present in the `MODULEPATH` environment, please run:

```bash
source /path/to/spack/share/spack/setup-env.sh
```

Finally, to run any of the pipeline scripts present in this package, you have to run:

```bash
module load ska-sdp-e2e-batch-continuum-imaging
# to set up some of the default env vars
source ${E2E_SETUP_ROOT}/env/defaults.envrc

# script_name can be "cal_bpp.sh", "inst.sh" or "e2e.sh" etc.
bash <script_name>
```

You can use `sbatch` instead of `bash` to run the scripts as a slurm job on HPC infrastructure.

The only downsight of running the scripts directly from `spack` is that it might be difficult to modify the scripts (for example, due to permission issues of spack installation directory). In such cases, we would recommend that you [clone the scripts locally](#cloning-the-scripts-locally-and-running), customise them and run via either spack packages or local virtual environments. For spack, you can follow above steps to install the pipelines, and then use them with you custom scripts.

## Cloning the scripts locally and running

We use git's `sparse-checkout` feature to only clone the necessary package from the `ska_sdp_e2e_batch_continuum_imaging` repository.
To download the package, run following commands in a terminal:

```bash
FOLDER_NAME="e2e_scripts"
E2E_PACKAGE_PATH="src/ska_sdp_e2e_batch_continuum_imaging"
E2E_RELEASE_BRANCH="e2e-scripts-release"

mkdir $FOLDER_NAME
cd $FOLDER_NAME

git init -b $E2E_RELEASE_BRANCH
git remote add origin https://gitlab.com/ska-telescope/sdp/science-pipeline-workflows/ska-sdp-e2e-batch-continuum-imaging.git
git sparse-checkout init
git sparse-checkout set $E2E_PACKAGE_PATH
git pull --set-upstream origin $E2E_RELEASE_BRANCH

cd $E2E_PACKAGE_PATH

# IMPORTANT: This variable is used to locate the scripts and default config files
export E2E_SETUP_ROOT=$(pwd)
```

> If `sparse-checkout` doesn't work for any reason, you can simply ignore `git sparse-checkout *` commands in the above code snippet.

The above commands will:

* Create a new folder named `e2e_pipeline` in your current working directory.
* Initialize a new git repository in the folder
* Add the `ska_sdp_e2e_batch_continuum_imaging` repository as a remote
* Configure the git repository to only download the `src/ska_sdp_e2e_batch_continuum_imaging` folder from the remote repository
* Download the necessary files from the remote repository
* Change the current directory to the `src/ska_sdp_e2e_batch_continuum_imaging` folder

The current folder is still tracked using git, so later you can run `git pull` to get the latest changes in the package.

Once you clone the repository using above steps, you need to ensure that all the necessary pipelines are installed on your machine. For which you can either:

1. Use the spack installed pipelines as explained in [Running the scripts using spack](#running-the-scripts-using-spack-recommended) section.
2. Setup the pipelines using [local python virtual environments](#local-virtual-environment-setup)

Then, ensure that you are already exporting all the environment variables which are necessary for the script that you want to run (please refer to [configuration of scripts](#configuring-the-scripts) section).
Some of the environment variables can be set to default values by running following command:

```bash
source ${E2E_SETUP_ROOT}/env/defaults.envrc
```

Finally, to run the scripts, run any of the following commands which is relevent to your setup:

1. HPC (Slurm) infrastructure + Spack for pipelines

```bash
# Update PATH if necessary, so that OpenMPI and Slurm binaries are available
export PATH="":${PATH}
# Change `/path/to/spack` with you spack installation directory
source /path/to/spack/share/spack/setup-env.sh

export ENVRC_FILE=${E2E_SETUP_ROOT}/env/spack.envrc

sbatch ${E2E_SETUP_ROOT}/scripts/<script_to_run>
```

2. Non-HPC infrastructure + Spack for pipelines

```bash
# Change `/path/to/spack` with you spack installation directory
source /path/to/spack/share/spack/setup-env.sh

export ENVRC_FILE=${E2E_SETUP_ROOT}/env/spack.envrc

bash ${E2E_SETUP_ROOT}/scripts/<script_to_run>
```

3. Non-HPC infrastructure + Local virtual environments for packages

```bash
export ENVRC_FILE=${E2E_SETUP_ROOT}/env/local_venv.envrc

bash ${E2E_SETUP_ROOT}/scripts/<script_to_run>
```

By default all scripts will store output in the script's current working directory.
You can change that in multiple ways as shown below:

```bash
# 1. Set OUTPUT_DIR environment variable.
#    Works for both HPC and non-HPC environments
OUTPUT_DIR=/path/to/output/directory <sbatch/bash> ${E2E_SETUP_ROOT}/scripts/<script_to_run>

# 2. On HPC (Slurm), use `--chdir` option to `sbatch` command
#    to change runtime current working directory
sbatch --chdir=/path/to/output/directory ${E2E_SETUP_ROOT}/scripts/<script_to_run>
```

## Configuring the scripts

The scripts are configured using environment variables. So you don't need to modify the scripts in most of the cases.

Each script initially checks where the necessary environment variables are set, and only then it runs the actual pipeline.
These checks are done in the first few lines of each script.

**You need to ensure that at least the values needed by the script you wish to run are exported as correct environment variables**.

To get started quickly, we have provided `defaults.envrc` inside `./env` directory. Sourcing this file with `source ${E2E_SETUP_ROOT}/env/defaults.envrc` command will set a few of the environment variables.
You still need to set the input data path variables. (i.e. `CALIBRATOR` or `TARGET`)

The following tables contains the description of possible environment variables required by the scripts.

### Common variables

| Environment Variable | Description | Value in `defaults.envrc` |
| --- | --- | --- |
| E2E_SETUP_ROOT | The absolute path to `ska_sdp_e2e_batch_continuum_imaging` package which contains all the scripts and config files. | |
| ENVRC_FILE | Path to the .envrc file which sets up executables to run the pipelines. This file is specific to the infrastructure (AWS, Local, etc). By default its present in `${E2E_SETUP_ROOT}/env/` directory. | |
| OUTPUT_DIR | Path to the directory where script will write its output. Defaults to current working directory. Should be absolute path in order to avoid issues in distributed setup. If this value points to non-existent path, the directory will be created by the script before running the actual pipeline. | |

### Script specific variables

| Environment Variable | Description | Value in `defaults.envrc` |
| --- | --- | --- |
| **cal_bpp.sh** |
| CALIBRATOR | Path to the measurement set of the calibrator source. Should contain path to a single measurement set (MSv2). Multiple measurement sets are currently not supported. | |
| CAL_BPP_CONFIG | Path to the configuration yaml file for the Batch Preprocessing Pipeline, while processing the calibrator source. | `${E2E_SETUP_ROOT}/config/cal_bpp.yaml` |
| **inst.sh** |
| PRE_PROCESSED_CALIBRATOR  | Path to the output of Batch Preprocessing Pipeline, ran on calibrator source. Should be a single MSv2 path. No need to set this variable for `e2e.sh`. | |
| CALIBRATOR_SKY_MODEL | Path to the sky model of the calibrator source in **OSKAR CSV** format. | |
| INST_CONFIG | Path to the configuration yaml file for the Instrumental Calibration pipeline. | `${E2E_SETUP_ROOT}/config/inst.yaml` |
| INST_CACHE_DIR | (optional) Path to cache directory to store intermediate zarr file when running INST pipeline | |
| **target_bpp.sh** |
| TARGET | Path to the measurement set of the target source. Should contain path to a single measurement set (MSv2). Multiple measurement sets are currently not supported. | |
| TARGET_BPP_CONFIG | Path to the configuration yaml file for the Batch Preprocessing Pipeline, while processing then target source. | `${E2E_SETUP_ROOT}/config/target_bpp.yaml` |
| **rapthor.sh** |
| PRE_PROCESSED_TARGET | Path to the output of Batch Preprocessing Pipeline, ran on target source. Should be a single MSv2 path. No need to set this variable for `e2e.sh`. | |
| RAPTHOR_PARSET | Path to the parset file for rapthor pipeline | `${E2E_SETUP_ROOT}/config/rapthor_defaults_skalow.parset` |
| RAPTHOR_STRATEGY_FILE | Path to strategy file for rapthor pipeline. | `${E2E_SETUP_ROOT}/data/rapthor_custom_ska_low_strategy.py` |
| RAPTHOR_BATCH_SYSTEM | This will update the "batch_system" parameter in rapthor parset, if the corresponding placeholder is present in the parset. This has no effect if the input parset has no placeholders to replace. Can be either "single_machine" or "slurm". | "single_machine" |
| RAPTHOR_MAX_NODES | Similar to `RAPTHOR_BATCH_SYSTEM`, this will update the "max_nodes" param in rapthor parset, if such placeholder is present. | 1 |

### Other variables

| Environment Variable | Description | Value in `defaults.envrc` |
| --- | --- | --- |
| DP3_PREFIX | The prefix path of DP3 installation. This folder must contain "bin" folder, which contains "DP3" executable. This is needed only when you are running pipelines in a [local environment](#local-virtual-environment-setup). | |
| WSCLEAN_PREFIX | The prefix path of wsclean installation. This folder must contain "bin" folder, which contains "wsclean" executable. This is needed only when you are running pipelines in a [local environment](#local-virtual-environment-setup). | |

## Local Virtual Environment Setup

> NOTE: This setup will work for all pipelines **except Rapthor**. To install Rapthor, it is preferred to use `spack` package.

The `packages` directory contains python scripts which can be used to create python virtual environments seperately for each sub-pipeline (BPP, INST).
The directory contains 2 python files:

1. **config.py**: Defines the configuration of the environments (e.g. name of the environment, installed packages).

2. **setup.py**: Creates the python environments defined in `config.py`.

In order to run `setup.py`, you need to ensure that `python3.11` (or `python3.10`) is installed on your machine. This can be a system installed python, or from conda, or from any other source. This python will be used by `setup.py`, as the base python for newly created virtual environments.

To create the environments, simply run

```bash
python3.11 ./packages/setup.py # python3.10 should also work
```

The environments will be created inside the `./packages/` directory. Logs of the last installation are also stored in the `./packages` directory for future reference.

Apart from the python environments, **there are few other external dependencies:**

1. [DP3 @ 6.4](https://git.astron.nl/RD/DP3)

You have to install these dependencies by any means, for example using spack, system package manager or building from source. Using [spack](https://spack.readthedocs.io/en/v0.23.1/) is the preferred way. Please refer to the README of [ska-sdp-spack](https://gitlab.com/ska-telescope/sdp/ska-sdp-spack) repository for instructions on how to setup spack.

As also mentioned in [configuring the scripts](#configuring-the-scripts) section, you need to export following environment variables, so that pipelines can find the respective executables.

```bash
export DP3_PREFIX= # The prefix path of DP3 installation.
# Above folder must contain "bin" folder, which contains "DP3" executable.
```

Once the local environment setup is done, follow the instructions in [running the scripts](#cloning-the-scripts-locally-and-running) section to run the scripts on your machine.

## Getting Input Data

This section explains how to get the input data, apart from the calibrator and target measurement sets.
This is needed if you wish to run the scripts in any other infrastructure than the SKA DP AWS HPC cluster.

Tip: You can use the `./data` folder to store the input data files.

### Calibrator source sky Model

The Instrumental Calibration Pipeline (INST) expects a sky model of the calibrator source as one of the inputs. This parameter is set in the configuration yaml file, in [predict_vis](https://developer.skao.int/projects/ska-sdp-instrumental-calibration/en/latest/stage_config.html#predict-vis) stage. This sky model must be in the **OSKAR CSV** format.

In the **inst.sh** script, this sky model path is passed as environment variable `CALIBRATOR_SKY_MODEL` (as specified in [configuring the scripts](#configuring-the-scripts)).

For simulated datasets, the sky models are present along with the data. These can be downloaded either from:

1. The relevent [confluence page](https://confluence.skatelescope.org/display/SWSI/PI27+Low+G4+SDP+-+Dataset+and+Pipeline+parameters) for simulated datasets
2. From the AWS S3 bucket where the simulated dataset is stored

You can set the `CALIBRATOR_SKY_MODEL` environment variable to the path of the downloaded skymodel (or any different sky model that you have locally).