Quickstart

Follow the instructions below to process a CASA Measurement set with the batch pre-processing pipeline (BPP) on your local machine.

Prerequisites

Installed the SKA Batch Preprocessing Pipeline, following Installation.
A CASA Measurement Set, preferably no larger than a few GB.

Note

LOFAR, MeerKAT and OSKAR-simulated Measurement Sets should work. Avoid VLA Datasets, as they are typically not regular enough to be compatible with the pipeline – they may contain multiple spectral windows and observed fields.

Estimated time

10 minutes.

Steps

Follow these steps to run the Batch Preprocessing Pipeline.

1. Activate the environment

Activate the environment so that the pipeline commands are globally available:
cd <BPP_REPOSITORY>   # where you previously cloned the repository
source .venv/bin/activate
Verify that is is the case by running:
ska-sdp-batch-preprocess --help

2. Create a working directory structure for the pipeline run

This is were you will store the configuration file and the pipeline’s outputs.
cd <BASE_DIRECTORY>  # wherever you like
mkdir bpp_tutorial
cd bpp_tutorial
The pipeline also needs an empty directory to store its outputs, let’s create it now:
mkdir output

3. Write a configuration file

The pre-processing steps to apply are defined via a YAML configuration file. Copy-paste the following to a file named config.yaml in the current directory. Here we keep it simple, just flagging a range of observing frequencies and averaging the data.

Note

Feel free to tweak the flagged frequency range to your particular dataset.
steps:
    # Flag the 150.00 – 155.42 MHz band
    - step: preflagger
      frequency_ranges_mhz:
        - {start: 150.00, stop: 155.42}
    # Average visibilities in time and frequency by integer factors
    - step: averager
      timestep: 4
      freqstep: 4

4. Run the pipeline

Execute the pipeline by providing the configuration file, the empty output directory and the input CASA Measurement Set:
ska-sdp-batch-preprocess run -c config.yaml -o output/ /path/to/my_dataset.ms

5. Check the pipeline ran successfully

Once the run completes, the output directory should contain the following:
$ ls output/

config.yaml
dask-report.html
my_dataset_flagging_report.png
my_dataset_flagging_report.zarr
my_dataset.ms
pipeline.log
task-list.json
That is:

The pre-processed visibilities as a CASA Measurement Set with the exact same name as the input

An RFI flagging report as an xarray Dataset object, with a .zarr extension

A summary plot of the RFI flagging report with a .png extension

A copy of the configuration file used for the run

Logs of the pipeline

Additional diagnostic outputs related to the execution engine Dask

Next steps

To go further, you may want to:

Read the Configuration Guide and learn how to use more advanced steps.
Learn about the internals of the code: start from Introduction.
Process larger datasets on the AWS DP cluster, see Running on AWS.