Running on AWS
Follow the instructions below to process a CASA Measurement set with the batch pre-processing pipeline (BPP) on the AWS DP cluster.
Prerequisites
Access to the AWS cluster and basic proficiency in its usage: see SDP Pipelines Cookbook.
BPP tutorial (Quickstart).
A CASA Measurement Set on the AWS cluster.
Estimated time
10 minutes – excluding queue wait time and processing time.
Steps
Log into the AWS DP cluster head node and follow the instructions below.
1. Create a personal directory on the lustre partition
Large-scale processing must be done on the shared lustre partition at
/shared/fsx1. Create yourself a personal directory there if you haven’t already:mkdir -p /shared/fsx1/<MY_USER_NAME> cd /shared/fsx1/<MY_USER_NAME>
2. Create a working directory for the pipeline run
Once inside your personal working directory:
mkdir bpp_aws_tutorial cd bpp_aws_tutorial
3. Write a configuration file
The pre-processing steps to apply are defined via a YAML configuration file. Copy-paste the following to a file named
config.yamlin the current directory. Here we keep it simple, just flagging a range of observing frequencies and averaging the data.Note
Feel free to tweak the flagged frequency range to your particular dataset.
steps: # Flag the 150.00 – 155.42 MHz band - step: preflagger frequency_ranges_mhz: - {start: 150.00, stop: 155.42} # Average visibilities in time and frequency by integer factors - step: averager timestep: 4 freqstep: 4
4. Get the BPP SLURM script
Copy over the latest template SLURM script provided in the BPP repository.
wget https://gitlab.com/ska-telescope/sdp/science-pipeline-workflows/ska-sdp-batch-preprocess/-/raw/main/scripts/user/aws_bpp.sh
5. Submit the SLURM job to run the pipeline
The above SLURM script takes arguments in the form of environment variables. You can read the documentation inside the script for details. Assuming the input is at
/path/to/my_dataset.ms, you can submit the job as follows:sbatch --nodes=1 --time=02:00:00 \ --export=ALL,WORKERS_PER_NODE=1,DATASET=/path/to/my_dataset.ms \ aws_bpp.sh
6. Watch the job progress in the queue
watch -n5 squeue --meYou should see a Job ID and a “ST” column that denotes the state of the job.
“CF” means “configuring”, i.e. a node is being prepared for the run
“R” means “running”
7. Check the pipeline runs
Soon after the job starts running, you should see three things created: a base output directory, a standard output and a standard error file carrying the Job ID. Assuming you got Job ID 4173:
$ ls bpp_4173 bpp.4173.out bpp.4173.errYou can monitor the pipeline run progress by watching the pipeline log file:
$ watch -n5 tail -n40 bpp_4173/output/pipeline.log
8. Check the pipeline ran successfully
Once the run completes, the output directory should contain the following:
$ ls bpp_4173/output config.yaml dask-report.html my_dataset_flagging_report.png my_dataset_flagging_report.zarr my_dataset.ms pipeline.log task-list.jsonThat is:
The pre-processed visibilities as a CASA Measurement Set with the exact same name as the input
An RFI flagging report as an
xarrayDataset object, with a.zarrextensionA summary plot of the RFI flagging report with a
.pngextensionA copy of the configuration file used for the run
Logs of the pipeline
Additional diagnostic outputs related to the execution engine Dask
Next steps
To go further, you may want to:
Read the Configuration Guide and learn how to use more advanced steps.
Learn about the internals of the code: start from Introduction.
Use frequency chunking with a more compute-heavy configuration that includes the AOFlagger or Demixer steps. This will justify the use of more nodes and/or workers per node.