DD Pipeline Usage
Introduction
The general idea remains the same as for the DI pipeline, with a couple of main differences. Firstly, we start with a calibration stage first, which requires a user-provided initial sky model. Secondly, no model visibilities or calibrated visibilities are ever written here; calibration solutions are written out by DP3’s DDECal, and applied on-the-fly by WSClean during imaging. Rather than writing out model visibilities, WSClean returns a list of clean components with their coordinates and fluxes, which are internally used in the next calibration stage to generate model visibilities via a direct predict.
A single self-calibration cycle consists of the following stages:
Create the requested number of calibration patches / facets, making sure each of them contain enough apparent source flux. A few different strategies for doing this are available.
Save the vertex coordinates of the calibration patches as a DS9 region file.
Find in which calibration patch each source in the latest sky model lies. The resulting information is written to a sky model file in sourcedb format to be fed to DP3’s DDECal.
Launch DP3’s DDECal; it writes out a solution table as an HDF5 file that follows a prescribed layout.
Launch WSClean, which takes as inputs: the measurement set, the calibration solution table and the facet region file previously written. WSClean writes out a set of FITS files, along with the list of clean components it identified in sourcedb format
The list of sources in question is post-processed (filtering and clustering) before being given as an input to the calibration stage in the next cycle.
Configuration
Overview
The pipeline needs a configuration file in YAML format to specify various options, including options to be provided to DDECal and WSClean. While the pipeline does forcibly set some of these options, many are freely adjustable by the user, and independently so for each selfcal cycle.
The idea here is to provide an “expert interface” where many parameters are adjustable, as we expect that a lot of experimentation is going to be necessary to find a good calibration stragegy for SKA Mid.
Note
The configuration file schema will likely be subject to backwards-incompatible changes in the future, as we are still early in the development process.
Config validation
JSON Schema is used to enforce validity of configuration files. On startup, the pipeline will immediately throw an error if parameter names are misspelled, or if incorrect choices of parameter values are provided. Individual parameters to DP3 and WSClean are covered by those checks.
Note
The schema validation should cover most possible syntax mistakes, but does not prevent from specifying “scientifically bad” combinations of parameters to DP3 or WSClean for example.
Additionally, a command-line app is provided to manually check a configuration file. See the Additional Apps page for details. We highly recommend using it before submitting jobs on an HPC cluster.
Examples
We maintain valid and documented configuration file examples in the config/
directory
of the repository. A model configuration file is reproduced below.
For full details on tweakable DDECal and WSClean parameters, please refer to:
The WSClean documentation, and the command-line help of the
wsclean
app.
###############################################################################
# Model DD-selfcal pipeline configuration file
# Please adjust parameters to your own data and wishes
#
# Adjusted for the AA2 Mid simulated datasets (0.76s sampling time and
# independent corrupting gains in each channel)
###############################################################################
# Maximum fractional bandwidth that can be imaged as a single sub-band.
# If the fractional bandwidth of the input data exceeds that value,
# WSClean's wideband deconvolution options will be enabled, and -channels-out
# will be set to an appropriately chosen value.
# NOTE: Setting this to 2.0 will disable wideband deconvolution on any data.
max_fractional_bandwidth: 0.05
# Custom parameters for the initial imaging stage used to infer an initial sky
# model for self-calibration.
# Only runs if no sky model is provided to the pipeline.
initial_imaging:
# Override some default WSClean parameters
wsclean: {}
# List of dictionaries carrying the parameters for each of the desired
# self-calibration cycles.
selfcal_cycles:
# Cycle 1
- tesselation:
# Method to use to create the calibration patch / facet boundaries
# on the sky.
# Either "square_grid", "voronoi_brightest" or "kmeans" (recommended)
method: kmeans
# Number of calibration patches / facets
num_patches: 8
# Override default DDECal parameters
# This section is a dictionary where the keys can be any official
# DDECal option; the values must be acceptable value for said option.
# Provide values in their "natural" data type (e.g. nchan is an int, mode
# is a string).
ddecal:
solve.mode: scalarphase
# That's almost exactly one solution every minute for AA2 Mid datasets
solve.solint: 80
solve.nchan: 1
solve.propagatesolutions: true
solve.propagateconvergedonly: true
solve.solveralgorithm: hybrid
# Override default WSClean parameters
# Must be a dictionary of accepted WSClean options; options have
# an attached value, or list of values, to be provided in their "natural"
# data type.
wsclean:
# Command-line options without an associated value, like -multiscale,
# are enabled like this
multiscale: true
niter: 1_000_000
mgain: 0.8
weight: ["briggs", +0.5]
auto-mask: 30.0
auto-threshold: 1.0
# Cycle 2
- tesselation:
method: kmeans
num_patches: 8
ddecal:
solve.mode: scalar
solve.solint: 80
solve.nchan: 1
solve.propagatesolutions: true
solve.propagateconvergedonly: true
solve.solveralgorithm: hybrid
wsclean:
multiscale: true
niter: 1_000_000
mgain: 0.8
weight: ["briggs", +0.5]
auto-mask: 5.0
auto-threshold: 1.0
# Add more cycles below, or remove cycles as desired.
Wideband deconvolution
Wideband deconvolution is the method employed by WSClean to properly take into account the fact that the flux of sources may vary significantly as a function of frequency. When enabled, WSClean internally separates the data into several sub-bands which are imaged independently but then jointly deconvolved.
The method is explained in the WSClean documentation.
The wideband deconvolution parameters are automatically set by the pipeline based
on the configuration parameter max_fractional_bandwidth
.
Fractional bandwidth is the ratio between total bandwidth and centre frequency.
The behaviour is:
If the fractional bandwidth of the data exceeds
max_fractional_bandwidth
, choose the WSClean parameter-channels-out
so that every imaging sub-band has a fractional bandwidth below that threshold.Otherwise, do not enable wideband deconvolution
Description of parameters
The help text for the app can be obtained by running
mid-selfcal-dd --help
, and is reproduced below:
usage: mid-selfcal-dd [-h] [--version] [--singularity-image SINGULARITY_IMAGE] [--dask-scheduler DASK_SCHEDULER]
[--mpi-hosts MPI_HOSTS [MPI_HOSTS ...]] [--outdir OUTDIR] --config CONFIG [--sky-model SKY_MODEL] --num-pixels
NUM_PIXELS --pixel-scale PIXEL_SCALE
input_ms
Launch the SKA Mid direction-dependent self-calibration pipeline
positional arguments:
input_ms Input measurement set.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--singularity-image SINGULARITY_IMAGE
Optional path to a singularity image file with both WSClean and DP3 installed. If specified, run WSClean and
DP3 inside singularity containers; otherwise, run them on bare metal. (default: None)
--dask-scheduler DASK_SCHEDULER
Optional dask scheduler address to which to submit jobs. If specified, any eligible pipeline step will be
distributed on the associated Dask cluster. (default: None)
--mpi-hosts MPI_HOSTS [MPI_HOSTS ...]
List of hostnames on which to run MPI-eligible pipeline step. If the list is of length 1 or less, MPI is not
used. (default: None)
--outdir OUTDIR Directory path in which to write data products and temporary files; it will be created if necessary, with
all its parents. If not specified, create a uniquely named subdir in the current working directory, named
selfcal_YYYYMMDD_HHMMSS_<microseconds> (default: None)
--config CONFIG Path to the pipeline configuration file. (default: None)
--sky-model SKY_MODEL
Optional path to bootstrap sky model file in sourcedb format. If provided, use this sky model to bootstrap
calibration. Otherwise, the pipeline will run an initial imaging stage to make such a bootstrap sky model.
NOTE: the source fluxes must be **apparent** fluxes, that is with the primary beam attenuation already
applied. (default: None)
--num-pixels NUM_PIXELS
Output image size in pixels. (default: None)
--pixel-scale PIXEL_SCALE
Scale of a pixel in arcseconds. (default: None)
Input and output paths
The input measurement set file is specified as the only positional argument.
The user has the choice of specifying an output directory for the pipeline to store its products, as well as any temporary files created by DP3 and WSClean.
If --outdir
is not provided, a uniquely-named sub-directory of the current working directory will be created. Currently,
the naming pattern is selfcal_YYYMMDD_HHMMSS_<MICROSECONDS>
, i.e. based on the date and time the pipeline started processing.
If --outdir
is a directory that does not exist yet, it is created with its parent directories.
Optional: running WSClean and DP3 within singularity containers
By default, the pipeline assumes that WSClean and DP3 are installed on the host and attempts to run them on bare metal.
Alternatively, it is possible to specify a singularity image file via --singularity-image
inside which both WSClean and DP3 are expected to be installed.
In that case, the pipeline will run WSClean and DP3 inside singularity containers spun up from that image file,
i.e. automatically generate and execute the right singularity commands with the appropriate bind mount points.
Optional: dask distribution
The pipeline runs on a single node by default, but may use an existing multi-node dask cluster to distribute some of the computation.
To enable this, pass the address of the dask scheduler via --dask-scheduler
as a string HOSTNAME:PORT
.
The only processing stage currently eligible for dask distribution is the calibration stage, where every dask worker processes distinct time chunks of the data.
This scales very well with the number of nodes involved.
Note
Launching a dask cluster on your machine or an HPC facility is done separately; for the latter, please follow the instructions on the Running on SLURM page. It is implicitly assumed that all the nodes / dask workers involved have access to a common filesystem, as is customary on most HPC facilities.
Optional: MPI distribution
Imaging is eligible for MPI distribution, where every node processes distinct frequency bands of the data.
To enable this, pass the list of host names to use as a space-separated list via --mpi-hosts
.
Configuration file
A custom YAML configuration file for the pipeline must be provided (see above section). All the parameters of DP3’s DDECal and WSClean can be freely tweaked. This is where for example one might select:
What constraints are made on the Jones matrices (e.g. scalar vs. diagonal vs. full Jones), or the solution intervals in time and frequency.
The weighting mode used for imaging (e.g. uniform vs. briggs), whether to use multiscale deconvolution, etc.
Optional: Initial sky model file
By default, the pipeline starts by imaging the field without any direction-dependent corrections, in order to make
an initial sky model to bootstrap direction-dependent selfcal. However, it is possible to skip that stage and instead
supply a custom sky model file in sourcedb format via the --sky-model
option.
Note
The source fluxes provided in this file must be the apparent fluxes, that is after applying the primary beam attenuation factor.
Image size and scale
The parameters of the output image must be specified via:
--num-pixels
, the width and height of the image in pixels. Only square images can be made at this time.--pixel-scale
, the angular scale of a pixel in arcseconds.
Data Products
Temporary and superfluous files are automatically cleaned up at the end of the run, even in case of a crash. Intermediate outputs for each self-calibration cycle are stored in their own subdirectories; most of the files they contain are preserved for inspection and troubleshooting.
Final data products are written in the base output directory of the pipeline. The following files are always produced in a successful run:
File Name |
Description |
---|---|
config.yml |
Copy of the input pipeline configuration file |
logfile.txt |
Logs of the pipeline and of the subprocesses it executed (DP3 and WSClean in particular) |
logfile.jsonl |
Same log file in JSON lines format that is easily machine-parsed |
initial_skymodel.txt |
Initial sky model used to bootstrap self-cal, either provided by the user or automatically generated in the initial imaging stage |
initial_skymodel.reg |
Initial sky model in DS9 region file format |
final_image.fits |
Output Clean Image |
final_residual.fits |
Output Residual Image |
final_skymodel.txt |
Output sky model |
final_skymodel.reg |
Output sky model in DS9 region file format |