SDP E2E Batch Pipeline Scripts

This page describes the older way of running the end-to-end pipeline using slurm (bash) scripts.

This ska_sdp_e2e_batch_continuum_imaging contains the batch pipeline (slurm/bash) scripts which can be used to run the SKA SDP batch pipelines and the SKA SDP batch end-to-end pipeline. This documentation gives general instructions about how to run the scripts on any infrastructure such as HPC or non-HPC, with or without using spack packages etc. For instructions specfic to the SKA DP HPC Cluster (currently hosted on AWS), please visit the relevent confluence page.

Structure of the package

  1. config: Contains configuration files for all pipelines.

  2. data: Contains some extra input data needed for the pipelines.

  3. env: Contains scripts to setup runtime environment for pipeline scripts.

  4. packages: Contains setup and config files, used to install pipelines in python virtual environments

  5. scripts: Contains all the pipeline scripts.

Descriptions of the scripts

  1. cal_bpp.sh: Runs Batch Preprocessing Pipeline pipeline on the calibrator source

  2. inst.sh: Runs Instrumental Calibration Pipeline on the pre-processed calibrator source i.e. on the output of cal_bpp.sh.

  3. target_bpp.sh: Runs Batch Preprocessing Pipeline pipeline on the target source

  4. rapthor.sh: Runs Rapthor on pre-processed calibrator source i.e. on the output of target_bpp.sh.

  5. e2e.sh: Runs end-to-end batch continuum imaging pipeline. Internally it executes above scripts in a sequential order.

The scripts use batchlet tool to:

  1. Setup the dask cluster for each sub-pipeline in its environment

  2. Optionally, Run resource monitoring (CPU, Memory) using ska-sdp-benchmark-monitor.

Even though the scripts inside ./scripts directory are primarily made to run as slurm jobs, these can be run as regular bash scripts. Therefore, you can run all the pipelines locally, provided that:

  • You have setup the environment (spacks / python environment) needed to run the pipelines. For this, please read Running the scripts using spack or Local Virtual Environment Setup section.

  • You have necessary input data available locally (measurement sets, sky model files). For input data apart from measurement sets, please read Getting Input Data section.

  • (Optionally) You modify the dask_params section under batchlet command in the scripts, according to your system specifications.

Out of all the scripts, e2e.sh script should run out of the box, after sourcing one of the .envrc file depending on the environment. The e2e.sh scripts takes care of the dependencies between the outputs of pipelines. For other scripts, since some pipelines are dependent on the output of previous pipelines, you may have to set some extra environment variables.

Cloning the scripts locally and running

We use git's sparse-checkout feature to only clone the necessary package from the ska_sdp_e2e_batch_continuum_imaging repository. To download the package, run following commands in a terminal:

FOLDER_NAME="e2e_scripts"
E2E_PACKAGE_PATH="src/ska_sdp_e2e_batch_continuum_imaging"
E2E_RELEASE_BRANCH="e2e-scripts-release"

mkdir $FOLDER_NAME
cd $FOLDER_NAME

git init -b $E2E_RELEASE_BRANCH
git remote add origin https://gitlab.com/ska-telescope/sdp/science-pipeline-workflows/ska-sdp-e2e-batch-continuum-imaging.git
git sparse-checkout init
git sparse-checkout set $E2E_PACKAGE_PATH
git pull --set-upstream origin $E2E_RELEASE_BRANCH

cd $E2E_PACKAGE_PATH

# IMPORTANT: This variable is used to locate the scripts and default config files
export E2E_SETUP_ROOT=$(pwd)

If sparse-checkout doesn't work for any reason, you can simply ignore git sparse-checkout * commands in the above code snippet.

The above commands will:

  • Create a new folder named e2e_pipeline in your current working directory.

  • Initialize a new git repository in the folder

  • Add the ska_sdp_e2e_batch_continuum_imaging repository as a remote

  • Configure the git repository to only download the src/ska_sdp_e2e_batch_continuum_imaging folder from the remote repository

  • Download the necessary files from the remote repository

  • Change the current directory to the src/ska_sdp_e2e_batch_continuum_imaging folder

The current folder is still tracked using git, so later you can run git pull to get the latest changes in the package.

Once you clone the repository using above steps, you need to ensure that all the necessary pipelines are installed on your machine. For which you can either:

  1. Use the spack installed pipelines as explained in Running the scripts using spack section.

  2. Setup the pipelines using local python virtual environments

Then, ensure that you are already exporting all the environment variables which are necessary for the script that you want to run (please refer to configuration of scripts section). Some of the environment variables can be set to default values by running following command:

source ${E2E_SETUP_ROOT}/env/defaults.envrc

Finally, to run the scripts, run any of the following commands which is relevent to your setup:

  1. HPC (Slurm) infrastructure + Spack for pipelines

# Update PATH if necessary, so that OpenMPI and Slurm binaries are available
export PATH="":${PATH}
# Change `/path/to/spack` with you spack installation directory
source /path/to/spack/share/spack/setup-env.sh

export ENVRC_FILE=${E2E_SETUP_ROOT}/env/spack.envrc

sbatch ${E2E_SETUP_ROOT}/scripts/<script_to_run>
  1. Non-HPC infrastructure + Spack for pipelines

# Change `/path/to/spack` with you spack installation directory
source /path/to/spack/share/spack/setup-env.sh

export ENVRC_FILE=${E2E_SETUP_ROOT}/env/spack.envrc

bash ${E2E_SETUP_ROOT}/scripts/<script_to_run>
  1. Non-HPC infrastructure + Local virtual environments for packages

export ENVRC_FILE=${E2E_SETUP_ROOT}/env/local_venv.envrc

bash ${E2E_SETUP_ROOT}/scripts/<script_to_run>

By default all scripts will store output in the script's current working directory. You can change that in multiple ways as shown below:

# 1. Set OUTPUT_DIR environment variable.
#    Works for both HPC and non-HPC environments
OUTPUT_DIR=/path/to/output/directory <sbatch/bash> ${E2E_SETUP_ROOT}/scripts/<script_to_run>

# 2. On HPC (Slurm), use `--chdir` option to `sbatch` command
#    to change runtime current working directory
sbatch --chdir=/path/to/output/directory ${E2E_SETUP_ROOT}/scripts/<script_to_run>

Configuring the scripts

The scripts are configured using environment variables. So you don't need to modify the scripts in most of the cases.

Each script initially checks where the necessary environment variables are set, and only then it runs the actual pipeline. These checks are done in the first few lines of each script.

You need to ensure that at least the values needed by the script you wish to run are exported as correct environment variables.

To get started quickly, we have provided defaults.envrc inside ./env directory. Sourcing this file with source ${E2E_SETUP_ROOT}/env/defaults.envrc command will set a few of the environment variables. You still need to set the input data path variables. (i.e. CALIBRATOR or TARGET)

The following tables contains the description of possible environment variables required by the scripts.

Common variables

Environment Variable

Description

Value in defaults.envrc

E2E_SETUP_ROOT

The absolute path to ska_sdp_e2e_batch_continuum_imaging package which contains all the scripts and config files.

ENVRC_FILE

Path to the .envrc file which sets up executables to run the pipelines. This file is specific to the infrastructure (AWS, Local, etc). By default its present in ${E2E_SETUP_ROOT}/env/ directory.

OUTPUT_DIR

Path to the directory where script will write its output. Defaults to current working directory. Should be absolute path in order to avoid issues in distributed setup. If this value points to non-existent path, the directory will be created by the script before running the actual pipeline.

Script specific variables

Environment Variable

Description

Value in defaults.envrc

cal_bpp.sh

CALIBRATOR

Path to the measurement set of the calibrator source. Should contain path to a single measurement set (MSv2). Multiple measurement sets are currently not supported.

CAL_BPP_CONFIG

Path to the configuration yaml file for the Batch Preprocessing Pipeline, while processing the calibrator source.

${E2E_SETUP_ROOT}/config/cal_bpp.yaml

inst.sh

PRE_PROCESSED_CALIBRATOR

Path to the output of Batch Preprocessing Pipeline, ran on calibrator source. Should be a single MSv2 path. No need to set this variable for e2e.sh.

CALIBRATOR_SKY_MODEL

Path to the sky model of the calibrator source in OSKAR CSV format.

INST_CONFIG

Path to the configuration yaml file for the Instrumental Calibration pipeline.

${E2E_SETUP_ROOT}/config/inst.yaml

INST_CACHE_DIR

(optional) Path to cache directory to store intermediate zarr file when running INST pipeline

target_bpp.sh

TARGET

Path to the measurement set of the target source. Should contain path to a single measurement set (MSv2). Multiple measurement sets are currently not supported.

TARGET_BPP_CONFIG

Path to the configuration yaml file for the Batch Preprocessing Pipeline, while processing then target source.

${E2E_SETUP_ROOT}/config/target_bpp.yaml

rapthor.sh

PRE_PROCESSED_TARGET

Path to the output of Batch Preprocessing Pipeline, ran on target source. Should be a single MSv2 path. No need to set this variable for e2e.sh.

RAPTHOR_PARSET

Path to the parset file for rapthor pipeline

${E2E_SETUP_ROOT}/config/rapthor_defaults_skalow.parset

RAPTHOR_STRATEGY_FILE

Path to strategy file for rapthor pipeline.

${E2E_SETUP_ROOT}/data/rapthor_custom_ska_low_strategy.py

RAPTHOR_BATCH_SYSTEM

This will update the "batch_system" parameter in rapthor parset, if the corresponding placeholder is present in the parset. This has no effect if the input parset has no placeholders to replace. Can be either "single_machine" or "slurm".

"single_machine"

RAPTHOR_MAX_NODES

Similar to RAPTHOR_BATCH_SYSTEM, this will update the "max_nodes" param in rapthor parset, if such placeholder is present.

1

Other variables

Environment Variable

Description

Value in defaults.envrc

DP3_PREFIX

The prefix path of DP3 installation. This folder must contain "bin" folder, which contains "DP3" executable. This is needed only when you are running pipelines in a local environment.

WSCLEAN_PREFIX

The prefix path of wsclean installation. This folder must contain "bin" folder, which contains "wsclean" executable. This is needed only when you are running pipelines in a local environment.

Local Virtual Environment Setup

NOTE: This setup will work for all pipelines except Rapthor. To install Rapthor, it is preferred to use spack package.

The packages directory contains python scripts which can be used to create python virtual environments seperately for each sub-pipeline (BPP, INST). The directory contains 2 python files:

  1. config.py: Defines the configuration of the environments (e.g. name of the environment, installed packages).

  2. setup.py: Creates the python environments defined in config.py.

In order to run setup.py, you need to ensure that python3.11 (or python3.10) is installed on your machine. This can be a system installed python, or from conda, or from any other source. This python will be used by setup.py, as the base python for newly created virtual environments.

To create the environments, simply run

python3.11 ./packages/setup.py # python3.10 should also work

The environments will be created inside the ./packages/ directory. Logs of the last installation are also stored in the ./packages directory for future reference.

Apart from the python environments, there are few other external dependencies:

  1. DP3 @ 6.4

You have to install these dependencies by any means, for example using spack, system package manager or building from source. Using spack is the preferred way. Please refer to the README of ska-sdp-spack repository for instructions on how to setup spack.

As also mentioned in configuring the scripts section, you need to export following environment variables, so that pipelines can find the respective executables.

export DP3_PREFIX= # The prefix path of DP3 installation.
# Above folder must contain "bin" folder, which contains "DP3" executable.

Once the local environment setup is done, follow the instructions in running the scripts section to run the scripts on your machine.

Getting Input Data

This section explains how to get the input data, apart from the calibrator and target measurement sets. This is needed if you wish to run the scripts in any other infrastructure than the SKA DP AWS HPC cluster.

Tip: You can use the ./data folder to store the input data files.

Calibrator source sky Model

The Instrumental Calibration Pipeline (INST) expects a sky model of the calibrator source as one of the inputs. This parameter is set in the configuration yaml file, in predict_vis stage. This sky model must be in the OSKAR CSV format.

In the inst.sh script, this sky model path is passed as environment variable CALIBRATOR_SKY_MODEL (as specified in configuring the scripts).

For simulated datasets, the sky models are present along with the data. These can be downloaded either from:

  1. The relevent confluence page for simulated datasets

  2. From the AWS S3 bucket where the simulated dataset is stored

You can set the CALIBRATOR_SKY_MODEL environment variable to the path of the downloaded skymodel (or any different sky model that you have locally).