Quickstart
Follow the instructions below to process a CASA Measurement set with the batch pre-processing pipeline (BPP) on your local machine.
Prerequisites
Installed the SKA Batch Preprocessing Pipeline, following Installation.
A CASA Measurement Set, preferably no larger than a few GB.
Note
LOFAR, MeerKAT and OSKAR-simulated Measurement Sets should work. Avoid VLA Datasets, as they are typically not regular enough to be compatible with the pipeline – they may contain multiple spectral windows and observed fields.
Estimated time
10 minutes.
Steps
Follow these steps to run the Batch Preprocessing Pipeline.
1. Activate the environment
Activate the environment so that the pipeline commands are globally available:
cd <BPP_REPOSITORY> # where you previously cloned the repository source .venv/bin/activateVerify that is is the case by running:
ska-sdp-batch-preprocess --help
2. Create a working directory structure for the pipeline run
This is were you will store the configuration file and the pipeline’s outputs.
cd <BASE_DIRECTORY> # wherever you like mkdir bpp_tutorial cd bpp_tutorialThe pipeline also needs an empty directory to store its outputs, let’s create it now:
mkdir output
3. Write a configuration file
The pre-processing steps to apply are defined via a YAML configuration file. Copy-paste the following to a file named
config.yamlin the current directory. Here we keep it simple, just flagging a range of observing frequencies and averaging the data.Note
Feel free to tweak the flagged frequency range to your particular dataset.
steps: # Flag the 150.00 – 155.42 MHz band - step: preflagger frequency_ranges_mhz: - {start: 150.00, stop: 155.42} # Average visibilities in time and frequency by integer factors - step: averager timestep: 4 freqstep: 4
4. Run the pipeline
Execute the pipeline by providing the configuration file, the empty output directory and the input CASA Measurement Set:
ska-sdp-batch-preprocess run -c config.yaml -o output/ /path/to/my_dataset.ms
5. Check the pipeline ran successfully
Once the run completes, the output directory should contain the following:
$ ls output/ config.yaml dask-report.html my_dataset_flagging_report.png my_dataset_flagging_report.zarr my_dataset.ms pipeline.log task-list.jsonThat is:
The pre-processed visibilities as a CASA Measurement Set with the exact same name as the input
An RFI flagging report as an
xarrayDataset object, with a.zarrextensionA summary plot of the RFI flagging report with a
.pngextensionA copy of the configuration file used for the run
Logs of the pipeline
Additional diagnostic outputs related to the execution engine Dask
Next steps
To go further, you may want to:
Read the Configuration Guide and learn how to use more advanced steps.
Learn about the internals of the code: start from Introduction.
Process larger datasets on the AWS DP cluster, see Running on AWS.