SDP E2E Batch Pipeline Scripts

This page describes the older way of running the end-to-end pipeline using slurm (bash) scripts.

This ska_sdp_e2e_batch_continuum_imaging contains the batch pipeline (slurm/bash) scripts which can be used to run the SKA SDP batch pipelines and the SKA SDP batch end-to-end pipeline. This documentation gives general instructions about how to run the scripts on any infrastructure such as HPC or non-HPC, with or without using spack packages etc. For instructions specfic to the SKA DP HPC Cluster (currently hosted on AWS), please visit the relevent confluence page.

Structure of the package

config: Contains configuration files for all pipelines.
data: Contains some extra input data needed for the pipelines.
env: Contains scripts to setup runtime environment for pipeline scripts.
packages: Contains setup and config files, used to install pipelines in python virtual environments
scripts: Contains all the pipeline scripts.

Descriptions of the scripts

cal_bpp.sh: Runs Batch Preprocessing Pipeline pipeline on the calibrator source
inst.sh: Runs Instrumental Calibration Pipeline on the pre-processed calibrator source i.e. on the output of cal_bpp.sh.
target_bpp.sh: Runs Batch Preprocessing Pipeline pipeline on the target source
rapthor.sh: Runs Rapthor on pre-processed calibrator source i.e. on the output of target_bpp.sh.
e2e.sh: Runs end-to-end batch continuum imaging pipeline. Internally it executes above scripts in a sequential order.

The scripts use batchlet tool to:

Setup the dask cluster for each sub-pipeline in its environment
Optionally, Run resource monitoring (CPU, Memory) using ska-sdp-benchmark-monitor.

Even though the scripts inside ./scripts directory are primarily made to run as slurm jobs, these can be run as regular bash scripts. Therefore, you can run all the pipelines locally, provided that:

You have setup the environment (spacks / python environment) needed to run the pipelines. For this, please read Running the scripts using spack or Local Virtual Environment Setup section.
You have necessary input data available locally (measurement sets, sky model files). For input data apart from measurement sets, please read Getting Input Data section.
(Optionally) You modify the dask_params section under batchlet command in the scripts, according to your system specifications.

Out of all the scripts, e2e.sh script should run out of the box, after sourcing one of the .envrc file depending on the environment. The e2e.sh scripts takes care of the dependencies between the outputs of pipelines. For other scripts, since some pipelines are dependent on the output of previous pipelines, you may have to set some extra environment variables.

Running the scripts using spack (recommended)

If you have spack on your machine, you can use spack to install all the pipelines and the scripts, and use the scripts directly. This will ensure consistency across all the environments (i.e. you will have the exact same packages as it would be on SKA's staging / production infrastructure).

The spack packages for the end-to-end pipeline and all the sub-pipelines are part of the ska-sdp-spack. You can follow the README to setup spack on your machine. Then install the e2e pipeline using this command:

spack install ska-sdp-e2e-batch-continuum-imaging

This will install all the scripts and default configuration files in the standard spack installation directories. This doesn't install any of the sub-pipelines. To install them, please run

spack install py-ska-sdp-batch-preprocess \
    py-ska-sdp-instrumental-calibration \
    py-rapthor

Since the end-to-end scripts rely on environment modules to update the runtime environment *PATH• variables, you need to generate modulefiles from spack packages. For this, you need to run:

spack module tcl refresh -y

NOTE: You need to run above command to re-generate the modulefiles once you reinstall any of the pipelines.

To ensure the modulefiles are present in the MODULEPATH environment, please run:

source /path/to/spack/share/spack/setup-env.sh

Finally, to run any of the pipeline scripts present in this package, you have to run:

module load ska-sdp-e2e-batch-continuum-imaging
# to set up some of the default env vars
source ${E2E_SETUP_ROOT}/env/defaults.envrc

# script_name can be "cal_bpp.sh", "inst.sh" or "e2e.sh" etc.
bash <script_name>

You can use sbatch instead of bash to run the scripts as a slurm job on HPC infrastructure.

The only downsight of running the scripts directly from spack is that it might be difficult to modify the scripts (for example, due to permission issues of spack installation directory). In such cases, we would recommend that you clone the scripts locally, customise them and run via either spack packages or local virtual environments. For spack, you can follow above steps to install the pipelines, and then use them with you custom scripts.

Cloning the scripts locally and running

We use git's sparse-checkout feature to only clone the necessary package from the ska_sdp_e2e_batch_continuum_imaging repository. To download the package, run following commands in a terminal:

FOLDER_NAME="e2e_scripts"
E2E_PACKAGE_PATH="src/ska_sdp_e2e_batch_continuum_imaging"
E2E_RELEASE_BRANCH="e2e-scripts-release"

mkdir $FOLDER_NAME
cd $FOLDER_NAME

git init -b $E2E_RELEASE_BRANCH
git remote add origin https://gitlab.com/ska-telescope/sdp/science-pipeline-workflows/ska-sdp-e2e-batch-continuum-imaging.git
git sparse-checkout init
git sparse-checkout set $E2E_PACKAGE_PATH
git pull --set-upstream origin $E2E_RELEASE_BRANCH

cd $E2E_PACKAGE_PATH

# IMPORTANT: This variable is used to locate the scripts and default config files
export E2E_SETUP_ROOT=$(pwd)

If sparse-checkout doesn't work for any reason, you can simply ignore git sparse-checkout * commands in the above code snippet.

The above commands will:

Create a new folder named e2e_pipeline in your current working directory.
Initialize a new git repository in the folder
Add the ska_sdp_e2e_batch_continuum_imaging repository as a remote
Configure the git repository to only download the src/ska_sdp_e2e_batch_continuum_imaging folder from the remote repository
Download the necessary files from the remote repository
Change the current directory to the src/ska_sdp_e2e_batch_continuum_imaging folder

The current folder is still tracked using git, so later you can run git pull to get the latest changes in the package.

Once you clone the repository using above steps, you need to ensure that all the necessary pipelines are installed on your machine. For which you can either:

Use the spack installed pipelines as explained in Running the scripts using spack section.
Setup the pipelines using local python virtual environments

Then, ensure that you are already exporting all the environment variables which are necessary for the script that you want to run (please refer to configuration of scripts section). Some of the environment variables can be set to default values by running following command:

source ${E2E_SETUP_ROOT}/env/defaults.envrc

Finally, to run the scripts, run any of the following commands which is relevent to your setup:

HPC (Slurm) infrastructure + Spack for pipelines

# Update PATH if necessary, so that OpenMPI and Slurm binaries are available
export PATH="":${PATH}
# Change `/path/to/spack` with you spack installation directory
source /path/to/spack/share/spack/setup-env.sh

export ENVRC_FILE=${E2E_SETUP_ROOT}/env/spack.envrc

sbatch ${E2E_SETUP_ROOT}/scripts/<script_to_run>

Non-HPC infrastructure + Spack for pipelines

# Change `/path/to/spack` with you spack installation directory
source /path/to/spack/share/spack/setup-env.sh

export ENVRC_FILE=${E2E_SETUP_ROOT}/env/spack.envrc

bash ${E2E_SETUP_ROOT}/scripts/<script_to_run>

Non-HPC infrastructure + Local virtual environments for packages

export ENVRC_FILE=${E2E_SETUP_ROOT}/env/local_venv.envrc

bash ${E2E_SETUP_ROOT}/scripts/<script_to_run>

By default all scripts will store output in the script's current working directory. You can change that in multiple ways as shown below:

# 1. Set OUTPUT_DIR environment variable.
#    Works for both HPC and non-HPC environments
OUTPUT_DIR=/path/to/output/directory <sbatch/bash> ${E2E_SETUP_ROOT}/scripts/<script_to_run>

# 2. On HPC (Slurm), use `--chdir` option to `sbatch` command
#    to change runtime current working directory
sbatch --chdir=/path/to/output/directory ${E2E_SETUP_ROOT}/scripts/<script_to_run>

Configuring the scripts

The scripts are configured using environment variables. So you don't need to modify the scripts in most of the cases.

Each script initially checks where the necessary environment variables are set, and only then it runs the actual pipeline. These checks are done in the first few lines of each script.

You need to ensure that at least the values needed by the script you wish to run are exported as correct environment variables.

To get started quickly, we have provided defaults.envrc inside ./env directory. Sourcing this file with source ${E2E_SETUP_ROOT}/env/defaults.envrc command will set a few of the environment variables. You still need to set the input data path variables. (i.e. CALIBRATOR or TARGET)

The following tables contains the description of possible environment variables required by the scripts.

Common variables

Environment Variable	Description	Value in `defaults.envrc`
E2E_SETUP_ROOT	The absolute path to `ska_sdp_e2e_batch_continuum_imaging` package which contains all the scripts and config files.
ENVRC_FILE	Path to the .envrc file which sets up executables to run the pipelines. This file is specific to the infrastructure (AWS, Local, etc). By default its present in `${E2E_SETUP_ROOT}/env/` directory.
OUTPUT_DIR	Path to the directory where script will write its output. Defaults to current working directory. Should be absolute path in order to avoid issues in distributed setup. If this value points to non-existent path, the directory will be created by the script before running the actual pipeline.

Script specific variables

Environment Variable	Description	Value in `defaults.envrc`
cal_bpp.sh
CALIBRATOR	Path to the measurement set of the calibrator source. Should contain path to a single measurement set (MSv2). Multiple measurement sets are currently not supported.
CAL_BPP_CONFIG	Path to the configuration yaml file for the Batch Preprocessing Pipeline, while processing the calibrator source.	`${E2E_SETUP_ROOT}/config/cal_bpp.yaml`
inst.sh
PRE_PROCESSED_CALIBRATOR	Path to the output of Batch Preprocessing Pipeline, ran on calibrator source. Should be a single MSv2 path. No need to set this variable for `e2e.sh`.
CALIBRATOR_SKY_MODEL	Path to the sky model of the calibrator source in OSKAR CSV format.
INST_CONFIG	Path to the configuration yaml file for the Instrumental Calibration pipeline.	`${E2E_SETUP_ROOT}/config/inst.yaml`
INST_CACHE_DIR	(optional) Path to cache directory to store intermediate zarr file when running INST pipeline
target_bpp.sh
TARGET	Path to the measurement set of the target source. Should contain path to a single measurement set (MSv2). Multiple measurement sets are currently not supported.
TARGET_BPP_CONFIG	Path to the configuration yaml file for the Batch Preprocessing Pipeline, while processing then target source.	`${E2E_SETUP_ROOT}/config/target_bpp.yaml`
rapthor.sh
PRE_PROCESSED_TARGET	Path to the output of Batch Preprocessing Pipeline, ran on target source. Should be a single MSv2 path. No need to set this variable for `e2e.sh`.
RAPTHOR_PARSET	Path to the parset file for rapthor pipeline	`${E2E_SETUP_ROOT}/config/rapthor_defaults_skalow.parset`
RAPTHOR_STRATEGY_FILE	Path to strategy file for rapthor pipeline.	`${E2E_SETUP_ROOT}/data/rapthor_custom_ska_low_strategy.py`
RAPTHOR_BATCH_SYSTEM	This will update the "batch_system" parameter in rapthor parset, if the corresponding placeholder is present in the parset. This has no effect if the input parset has no placeholders to replace. Can be either "single_machine" or "slurm".	"single_machine"
RAPTHOR_MAX_NODES	Similar to `RAPTHOR_BATCH_SYSTEM`, this will update the "max_nodes" param in rapthor parset, if such placeholder is present.	1

Other variables

Environment Variable	Description	Value in `defaults.envrc`
DP3_PREFIX	The prefix path of DP3 installation. This folder must contain "bin" folder, which contains "DP3" executable. This is needed only when you are running pipelines in a local environment.
WSCLEAN_PREFIX	The prefix path of wsclean installation. This folder must contain "bin" folder, which contains "wsclean" executable. This is needed only when you are running pipelines in a local environment.

Local Virtual Environment Setup

NOTE: This setup will work for all pipelines except Rapthor. To install Rapthor, it is preferred to use spack package.

The packages directory contains python scripts which can be used to create python virtual environments seperately for each sub-pipeline (BPP, INST). The directory contains 2 python files:

config.py: Defines the configuration of the environments (e.g. name of the environment, installed packages).
setup.py: Creates the python environments defined in config.py.

In order to run setup.py, you need to ensure that python3.11 (or python3.10) is installed on your machine. This can be a system installed python, or from conda, or from any other source. This python will be used by setup.py, as the base python for newly created virtual environments.

To create the environments, simply run

python3.11 ./packages/setup.py # python3.10 should also work

The environments will be created inside the ./packages/ directory. Logs of the last installation are also stored in the ./packages directory for future reference.

Apart from the python environments, there are few other external dependencies:

DP3 @ 6.4

You have to install these dependencies by any means, for example using spack, system package manager or building from source. Using spack is the preferred way. Please refer to the README of ska-sdp-spack repository for instructions on how to setup spack.

As also mentioned in configuring the scripts section, you need to export following environment variables, so that pipelines can find the respective executables.

export DP3_PREFIX= # The prefix path of DP3 installation.
# Above folder must contain "bin" folder, which contains "DP3" executable.

Once the local environment setup is done, follow the instructions in running the scripts section to run the scripts on your machine.

Getting Input Data

This section explains how to get the input data, apart from the calibrator and target measurement sets. This is needed if you wish to run the scripts in any other infrastructure than the SKA DP AWS HPC cluster.

Tip: You can use the ./data folder to store the input data files.

Calibrator source sky Model

The Instrumental Calibration Pipeline (INST) expects a sky model of the calibrator source as one of the inputs. This parameter is set in the configuration yaml file, in predict_vis stage. This sky model must be in the OSKAR CSV format.

In the inst.sh script, this sky model path is passed as environment variable CALIBRATOR_SKY_MODEL (as specified in configuring the scripts).

For simulated datasets, the sky models are present along with the data. These can be downloaded either from:

The relevent confluence page for simulated datasets
From the AWS S3 bucket where the simulated dataset is stored

You can set the CALIBRATOR_SKY_MODEL environment variable to the path of the downloaded skymodel (or any different sky model that you have locally).