DD Pipeline Usage

Introduction

The general idea remains the same as for the DI pipeline, with a couple of main differences. Firstly, we start with a calibration stage first, which requires a user-provided initial sky model. Secondly, no model visibilities or calibrated visibilities are ever written here; calibration solutions are written out by DP3’s DDECal, and applied on-the-fly by WSClean during imaging. Rather than writing out model visibilities, WSClean returns a list of clean components with their coordinates and fluxes, which are internally used in the next calibration stage to generate model visibilities via a direct predict.

A single self-calibration cycle consists of the following stages:

  • Create the requested number of calibration patches / facets, making sure each of them contain enough apparent source flux. A few different strategies for doing this are available.

  • Save the vertex coordinates of the calibration patches as a DS9 region file.

  • Find in which calibration patch each source in the latest sky model lies. The resulting information is written to a sky model file in sourcedb format to be fed to DP3’s DDECal.

  • Launch DP3’s DDECal; it writes out a solution table as an HDF5 file that follows a prescribed layout.

  • Launch WSClean, which takes as inputs: the measurement set, the calibration solution table and the facet region file previously written. WSClean writes out a set of FITS files, along with the list of clean components it identified in sourcedb format

  • The list of sources in question is post-processed (filtering and clustering) before being given as an input to the calibration stage in the next cycle.

Configuration

Overview

The pipeline needs a configuration file in YAML format to specify various options, including options to be provided to DDECal and WSClean. While the pipeline does forcibly set some of these options, many are freely adjustable by the user, and independently so for each selfcal cycle.

The idea here is to provide an “expert interface” where many parameters are adjustable, as we expect that a lot of experimentation is going to be necessary to find a good calibration stragegy for SKA Mid.

Note

The configuration file schema will likely be subject to backwards-incompatible changes in the future, as we are still early in the development process.

Config validation

JSON Schema is used to enforce validity of configuration files. On startup, the pipeline will immediately throw an error if parameter names are misspelled, or if incorrect choices of parameter values are provided. Individual parameters to DP3 and WSClean are covered by those checks.

Note

The schema validation should cover most possible syntax mistakes, but does not prevent from specifying “scientifically bad” combinations of parameters to DP3 or WSClean for example.

Additionally, a command-line app is provided to manually check a configuration file. See the Additional Apps page for details. We highly recommend using it before submitting jobs on an HPC cluster.

Examples

We maintain valid and documented configuration file examples in the config/ directory of the repository. A model configuration file is reproduced below.

For full details on tweakable DDECal and WSClean parameters, please refer to:

###############################################################################
# Model DD-selfcal pipeline configuration file
# Please adjust parameters to your own data and wishes
#
# Adjusted for the AA2 Mid simulated datasets (0.76s sampling time and
# independent corrupting gains in each channel)
###############################################################################

# Maximum fractional bandwidth that can be imaged as a single sub-band.
# If the fractional bandwidth of the input data exceeds that value,
# WSClean's wideband deconvolution options will be enabled, and -channels-out
# will be set to an appropriately chosen value.
# NOTE: Setting this to 2.0 will disable wideband deconvolution on any data.
max_fractional_bandwidth: 0.05

# Custom parameters for the initial imaging stage used to infer an initial sky
# model for self-calibration.
# Only runs if no sky model is provided to the pipeline.
initial_imaging:
  # Override some default WSClean parameters
  wsclean: {}

# List of dictionaries carrying the parameters for each of the desired
# self-calibration cycles.
selfcal_cycles:
  # Cycle 1
  - tesselation:
      # Method to use to create the calibration patch / facet boundaries
      # on the sky.
      # Either "square_grid", "voronoi_brightest" or "kmeans" (recommended)
      method: kmeans

      # Number of calibration patches / facets
      num_patches: 8

    # Override default DDECal parameters
    # This section is a dictionary where the keys can be any official
    # DDECal option; the values must be acceptable value for said option.
    # Provide values in their "natural" data type (e.g. nchan is an int, mode
    # is a string).
    ddecal:
      solve.mode: scalarphase
      # That's almost exactly one solution every minute for AA2 Mid datasets
      solve.solint: 80
      solve.nchan: 1
      solve.propagatesolutions: true
      solve.propagateconvergedonly: true
      solve.solveralgorithm: hybrid

    # Override default WSClean parameters
    # Must be a dictionary of accepted WSClean options; options have
    # an attached value, or list of values, to be provided in their "natural"
    # data type.
    wsclean:
      # Command-line options without an associated value, like -multiscale,
      # are enabled like this
      multiscale: true
      niter: 1_000_000
      mgain: 0.8
      weight: ["briggs", +0.5]
      auto-mask: 30.0
      auto-threshold: 1.0

  # Cycle 2
  - tesselation:
      method: kmeans
      num_patches: 8

    ddecal:
      solve.mode: scalar
      solve.solint: 80
      solve.nchan: 1
      solve.propagatesolutions: true
      solve.propagateconvergedonly: true
      solve.solveralgorithm: hybrid

    wsclean:
      multiscale: true
      niter: 1_000_000
      mgain: 0.8
      weight: ["briggs", +0.5]
      auto-mask: 5.0
      auto-threshold: 1.0

  # Add more cycles below, or remove cycles as desired.

Wideband deconvolution

Wideband deconvolution is the method employed by WSClean to properly take into account the fact that the flux of sources may vary significantly as a function of frequency. When enabled, WSClean internally separates the data into several sub-bands which are imaged independently but then jointly deconvolved.

The method is explained in the WSClean documentation.

The wideband deconvolution parameters are automatically set by the pipeline based on the configuration parameter max_fractional_bandwidth. Fractional bandwidth is the ratio between total bandwidth and centre frequency.

The behaviour is:

  • If the fractional bandwidth of the data exceeds max_fractional_bandwidth, choose the WSClean parameter -channels-out so that every imaging sub-band has a fractional bandwidth below that threshold.

  • Otherwise, do not enable wideband deconvolution

Description of parameters

The help text for the app can be obtained by running mid-selfcal-dd --help, and is reproduced below:

usage: mid-selfcal-dd [-h] [--version] [--singularity-image SINGULARITY_IMAGE] [--dask-scheduler DASK_SCHEDULER]
                      [--mpi-hosts MPI_HOSTS [MPI_HOSTS ...]] [--outdir OUTDIR] --config CONFIG [--sky-model SKY_MODEL] --num-pixels
                      NUM_PIXELS --pixel-scale PIXEL_SCALE
                      input_ms

Launch the SKA Mid direction-dependent self-calibration pipeline

positional arguments:
  input_ms              Input measurement set.

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --singularity-image SINGULARITY_IMAGE
                        Optional path to a singularity image file with both WSClean and DP3 installed. If specified, run WSClean and
                        DP3 inside singularity containers; otherwise, run them on bare metal. (default: None)
  --dask-scheduler DASK_SCHEDULER
                        Optional dask scheduler address to which to submit jobs. If specified, any eligible pipeline step will be
                        distributed on the associated Dask cluster. (default: None)
  --mpi-hosts MPI_HOSTS [MPI_HOSTS ...]
                        List of hostnames on which to run MPI-eligible pipeline step. If the list is of length 1 or less, MPI is not
                        used. (default: None)
  --outdir OUTDIR       Directory path in which to write data products and temporary files; it will be created if necessary, with
                        all its parents. If not specified, create a uniquely named subdir in the current working directory, named
                        selfcal_YYYYMMDD_HHMMSS_<microseconds> (default: None)
  --config CONFIG       Path to the pipeline configuration file. (default: None)
  --sky-model SKY_MODEL
                        Optional path to bootstrap sky model file in sourcedb format. If provided, use this sky model to bootstrap
                        calibration. Otherwise, the pipeline will run an initial imaging stage to make such a bootstrap sky model.
                        NOTE: the source fluxes must be **apparent** fluxes, that is with the primary beam attenuation already
                        applied. (default: None)
  --num-pixels NUM_PIXELS
                        Output image size in pixels. (default: None)
  --pixel-scale PIXEL_SCALE
                        Scale of a pixel in arcseconds. (default: None)

Input and output paths

The input measurement set file is specified as the only positional argument.

The user has the choice of specifying an output directory for the pipeline to store its products, as well as any temporary files created by DP3 and WSClean. If --outdir is not provided, a uniquely-named sub-directory of the current working directory will be created. Currently, the naming pattern is selfcal_YYYMMDD_HHMMSS_<MICROSECONDS>, i.e. based on the date and time the pipeline started processing. If --outdir is a directory that does not exist yet, it is created with its parent directories.

Optional: running WSClean and DP3 within singularity containers

By default, the pipeline assumes that WSClean and DP3 are installed on the host and attempts to run them on bare metal. Alternatively, it is possible to specify a singularity image file via --singularity-image inside which both WSClean and DP3 are expected to be installed. In that case, the pipeline will run WSClean and DP3 inside singularity containers spun up from that image file, i.e. automatically generate and execute the right singularity commands with the appropriate bind mount points.

Optional: dask distribution

The pipeline runs on a single node by default, but may use an existing multi-node dask cluster to distribute some of the computation. To enable this, pass the address of the dask scheduler via --dask-scheduler as a string HOSTNAME:PORT. The only processing stage currently eligible for dask distribution is the calibration stage, where every dask worker processes distinct time chunks of the data. This scales very well with the number of nodes involved.

Note

Launching a dask cluster on your machine or an HPC facility is done separately; for the latter, please follow the instructions on the Running on SLURM page. It is implicitly assumed that all the nodes / dask workers involved have access to a common filesystem, as is customary on most HPC facilities.

Optional: MPI distribution

Imaging is eligible for MPI distribution, where every node processes distinct frequency bands of the data. To enable this, pass the list of host names to use as a space-separated list via --mpi-hosts.

Configuration file

A custom YAML configuration file for the pipeline must be provided (see above section). All the parameters of DP3’s DDECal and WSClean can be freely tweaked. This is where for example one might select:

  • What constraints are made on the Jones matrices (e.g. scalar vs. diagonal vs. full Jones), or the solution intervals in time and frequency.

  • The weighting mode used for imaging (e.g. uniform vs. briggs), whether to use multiscale deconvolution, etc.

Optional: Initial sky model file

By default, the pipeline starts by imaging the field without any direction-dependent corrections, in order to make an initial sky model to bootstrap direction-dependent selfcal. However, it is possible to skip that stage and instead supply a custom sky model file in sourcedb format via the --sky-model option.

Note

The source fluxes provided in this file must be the apparent fluxes, that is after applying the primary beam attenuation factor.

Image size and scale

The parameters of the output image must be specified via:

  • --num-pixels, the width and height of the image in pixels. Only square images can be made at this time.

  • --pixel-scale, the angular scale of a pixel in arcseconds.

Data Products

Temporary and superfluous files are automatically cleaned up at the end of the run, even in case of a crash. Intermediate outputs for each self-calibration cycle are stored in their own subdirectories; most of the files they contain are preserved for inspection and troubleshooting.

Final data products are written in the base output directory of the pipeline. The following files are always produced in a successful run:

Description of output files

File Name

Description

config.yml

Copy of the input pipeline configuration file

logfile.txt

Logs of the pipeline and of the subprocesses it executed (DP3 and WSClean in particular)

logfile.jsonl

Same log file in JSON lines format that is easily machine-parsed

initial_skymodel.txt

Initial sky model used to bootstrap self-cal, either provided by the user or automatically generated in the initial imaging stage

initial_skymodel.reg

Initial sky model in DS9 region file format

final_image.fits

Output Clean Image

final_residual.fits

Output Residual Image

final_skymodel.txt

Output sky model

final_skymodel.reg

Output sky model in DS9 region file format