Running on AWS

Follow the instructions below to process a CASA Measurement set with the batch pre-processing pipeline (BPP) on the AWS DP cluster.

Prerequisites

  • Access to the AWS cluster and basic proficiency in its usage: see SDP Pipelines Cookbook.

  • BPP tutorial (Quickstart).

  • A CASA Measurement Set on the AWS cluster.

Estimated time

10 minutes – excluding queue wait time and processing time.

Steps

Log into the AWS DP cluster head node and follow the instructions below.

1. Create a personal directory on the lustre partition

Large-scale processing must be done on the shared lustre partition at /shared/fsx1. Create yourself a personal directory there if you haven’t already:

mkdir -p /shared/fsx1/<MY_USER_NAME>
cd /shared/fsx1/<MY_USER_NAME>

2. Create a working directory for the pipeline run

Once inside your personal working directory:

mkdir bpp_aws_tutorial
cd bpp_aws_tutorial

3. Write a configuration file

The pre-processing steps to apply are defined via a YAML configuration file. Copy-paste the following to a file named config.yaml in the current directory. Here we keep it simple, just flagging a range of observing frequencies and averaging the data.

Note

Feel free to tweak the flagged frequency range to your particular dataset.

steps:
    # Flag the 150.00 – 155.42 MHz band
    - step: preflagger
      frequency_ranges_mhz:
        - {start: 150.00, stop: 155.42}
    # Average visibilities in time and frequency by integer factors
    - step: averager
      timestep: 4
      freqstep: 4

4. Get the BPP SLURM script

Copy over the latest template SLURM script provided in the BPP repository.

wget https://gitlab.com/ska-telescope/sdp/science-pipeline-workflows/ska-sdp-batch-preprocess/-/raw/main/scripts/user/aws_bpp.sh

5. Submit the SLURM job to run the pipeline

The above SLURM script takes arguments in the form of environment variables. You can read the documentation inside the script for details. Assuming the input is at /path/to/my_dataset.ms, you can submit the job as follows:

sbatch --nodes=1 --time=02:00:00 \
    --export=ALL,WORKERS_PER_NODE=1,DATASET=/path/to/my_dataset.ms \
    aws_bpp.sh

6. Watch the job progress in the queue

watch -n5 squeue --me

You should see a Job ID and a “ST” column that denotes the state of the job.

  • “CF” means “configuring”, i.e. a node is being prepared for the run

  • “R” means “running”

7. Check the pipeline runs

Soon after the job starts running, you should see three things created: a base output directory, a standard output and a standard error file carrying the Job ID. Assuming you got Job ID 4173:

$ ls
bpp_4173
bpp.4173.out
bpp.4173.err

You can monitor the pipeline run progress by watching the pipeline log file:

$ watch -n5 tail -n40 bpp_4173/output/pipeline.log

8. Check the pipeline ran successfully

Once the run completes, the output directory should contain the following:

$ ls bpp_4173/output

config.yaml
dask-report.html
my_dataset_flagging_report.png
my_dataset_flagging_report.zarr
my_dataset.ms
pipeline.log
task-list.json

That is:

  • The pre-processed visibilities as a CASA Measurement Set with the exact same name as the input

  • An RFI flagging report as an xarray Dataset object, with a .zarr extension

  • A summary plot of the RFI flagging report with a .png extension

  • A copy of the configuration file used for the run

  • Logs of the pipeline

  • Additional diagnostic outputs related to the execution engine Dask

Next steps

To go further, you may want to:

  • Read the Configuration Guide and learn how to use more advanced steps.

  • Learn about the internals of the code: start from Introduction.

  • Use frequency chunking with a more compute-heavy configuration that includes the AOFlagger or Demixer steps. This will justify the use of more nodes and/or workers per node.