Overview

The batch pre-processing pipeline processes one or more CASA Measurement Sets. It follows a single configuration, multiple data design: the same sequence of steps defined by the configuration file is independently applied to each input, in parallel.

Command-line interface

A typical invocation is:

ska-sdp-batch-preprocess run
    --config myConfig.yaml
    --output-dir /path/to/base_output_dir
    --dask-scheduler localhost:8786
    input1.ms
    input2.ms
    ...

Arguments

Positional arguments

  • One or more input MeasurementSets in MSv2 format.

Required named arguments

  • --config Path to a YAML configuration file defining the sequence of pre-processing steps and their parameters.

  • --output-dir Directory where output MeasurementSets are written; it must not contain existing MSes.

Optional named arguments

  • --sdm-path Path to a Science Data Model (SDM) directory. Enables SDM mode, where calibration tables and other inputs are resolved from the SDM, and logs and QA products are written into it. See SDM Mode Guide for details.

  • --dask-scheduler Network address of a Dask scheduler. If provided, processing tasks are distributed across the associated Dask cluster. See Dask Scaling for details on configuring a dask cluster.

  • --frequency-chunk-hz Split the processing of each input MeasurementSet into independent frequency chunks with a maximum width (in Hz). If not specified, no frequency splitting is performed.

Inputs and outputs

Inputs are CASA Measurement Sets. They don’t have to be related, although in the SKA production environment, they are expected to represent distinct time intervals of one continuous observation.

For an input named <BASE_INPUT_NAME>.ms, the corresponding output MSv2 path is <OUTPUT_DIR>/<BASE_INPUT_NAME>.ms

Frequency chunking

Frequency chunking exists primarily to work around a limitation of DP3 when performing demixing: the Demixer step cannot fit different gains in different frequency bands within a single run.

When demixing is enabled, --frequency-chunk-hz should be set to the maximum bandwidth over which the corrupting gains affecting the bright sources can be considered uniform; this includes in particular the primary beam response and ionospheric effects.

Parallelism

The different input measurement sets are processed in parallel and independently. If frequency chunking is used, tasks are further split into independent frequency chunks, and the resulting output chunks are frequency-concatenated in the end.

This can lead the pipeline to scale better if some compute-intensive steps included, Demixing in particular.