.. _aws: ************** Running on AWS ************** Follow the instructions below to process a CASA Measurement set with the batch pre-processing pipeline (BPP) on the AWS DP cluster. Prerequisites ============= - Access to the AWS cluster and basic proficiency in its usage: see `SDP Pipelines Cookbook `_. - BPP tutorial (:ref:`quickstart`). - A CASA Measurement Set on the AWS cluster. Estimated time ============== 10 minutes -- excluding queue wait time and processing time. Steps ===== Log into the AWS DP cluster head node and follow the instructions below. **1. Create a personal directory on the lustre partition** Large-scale processing must be done on the shared lustre partition at ``/shared/fsx1``. Create yourself a personal directory there if you haven't already: .. code-block:: bash mkdir -p /shared/fsx1/ cd /shared/fsx1/ **2. Create a working directory for the pipeline run** Once inside your personal working directory: .. code-block:: bash mkdir bpp_aws_tutorial cd bpp_aws_tutorial **3. Write a configuration file** The pre-processing steps to apply are defined via a YAML configuration file. Copy-paste the following to a file named ``config.yaml`` in the current directory. Here we keep it simple, just flagging a range of observing frequencies and averaging the data. .. note:: Feel free to tweak the flagged frequency range to your particular dataset. .. code-block:: yaml steps: # Flag the 150.00 – 155.42 MHz band - step: preflagger frequency_ranges_mhz: - {start: 150.00, stop: 155.42} # Average visibilities in time and frequency by integer factors - step: averager timestep: 4 freqstep: 4 **4. Get the BPP SLURM script** Copy over the latest `template SLURM script provided in the BPP repository `_. .. code-block:: bash wget https://gitlab.com/ska-telescope/sdp/science-pipeline-workflows/ska-sdp-batch-preprocess/-/raw/main/scripts/user/aws_bpp.sh **5. Submit the SLURM job to run the pipeline** The above SLURM script takes arguments in the form of environment variables. You can read the documentation inside the script for details. Assuming the input is at ``/path/to/my_dataset.ms``, you can submit the job as follows: .. code-block:: bash sbatch --nodes=1 --time=02:00:00 \ --export=ALL,WORKERS_PER_NODE=1,DATASET=/path/to/my_dataset.ms \ aws_bpp.sh **6. Watch the job progress in the queue** .. code-block:: bash watch -n5 squeue --me You should see a Job ID and a "ST" column that denotes the state of the job. - "CF" means "configuring", i.e. a node is being prepared for the run - "R" means "running" **7. Check the pipeline runs** Soon after the job starts running, you should see three things created: a base output directory, a standard output and a standard error file carrying the Job ID. Assuming you got Job ID 4173: .. code-block:: bash $ ls bpp_4173 bpp.4173.out bpp.4173.err You can monitor the pipeline run progress by watching the pipeline log file: .. code-block:: bash $ watch -n5 tail -n40 bpp_4173/output/pipeline.log **8. Check the pipeline ran successfully** Once the run completes, the output directory should contain the following: .. code-block:: bash $ ls bpp_4173/output config.yaml dask-report.html my_dataset_flagging_report.png my_dataset_flagging_report.zarr my_dataset.ms pipeline.log task-list.json That is: - The pre-processed visibilities as a CASA Measurement Set with the exact same name as the input - An RFI flagging report as an ``xarray`` Dataset object, with a ``.zarr`` extension - A summary plot of the RFI flagging report with a ``.png`` extension - A copy of the configuration file used for the run - Logs of the pipeline - Additional diagnostic outputs related to the execution engine Dask Next steps ========== To go further, you may want to: - Read the :ref:`configuration` and learn how to use more advanced steps. - Learn about the internals of the code: start from :ref:`pipeline_intro`. - Use frequency chunking with a more compute-heavy configuration that includes the AOFlagger or Demixer steps. This will justify the use of more nodes and/or workers per node.