How to move and store data on AWS

This guide explains how to transfer data between the cluster and S3.

Prerequisites

  • An account on the AWS DP HPC cluster

Steps

  1. Create a project directory on shared storage

To keep your data organized, create a project-specific directory on the shared storage:

mkdir -p /shared/fsx1/<your-project>/
  1. Transfer data between S3 and shared storage

The aws s3 commands below can be run on the headnode or inside a SLURM job script. Use aws s3 sync to transfer whole directories and aws s3 cp for individual files.

Before transferring data, you can run the command with --dryrun to verify source and destination paths.

aws s3 cp --dryrun s3://skao-sdp-testdata/path/to/file.ms /shared/fsx1/<your-project>/
aws s3 sync --dryrun /shared/fsx1/<your-project>/dataset/ s3://skao-sdp-testdata/path/to/dataset/

Before transferring data, you can run the command with --dryrun to verify source and destination paths.

aws s3 cp --dryrun s3://skao-sdp-testdata/path/to/file.ms /shared/fsx1/<your-project>/
aws s3 sync --dryrun /shared/fsx1/<your-project>/dataset/ s3://skao-sdp-testdata/path/to/dataset/
  1. Copy a single file from S3 to shared storage

aws s3 cp s3://skao-sdp-testdata/path/to/file.ms /shared/fsx1/<your-project>/
  1. Copy a single file from shared storage to S3

aws s3 cp /shared/fsx1/<your-project>/output.fits s3://skao-sdp-testdata/path/to/output.fits
  1. Sync a directory from S3 to shared storage

This downloads any files in the S3 prefix that are missing from the local directory (or that have changed):

aws s3 sync s3://skao-sdp-testdata/path/to/dataset/ /shared/fsx1/<your-project>/dataset/
  1. Sync a directory from shared storage to S3

This uploads any files in the local directory that are missing from the S3 prefix (or that have changed):

aws s3 sync /shared/fsx1/<your-project>/dataset/ s3://skao-sdp-testdata/path/to/dataset/

Note

aws s3 sync only copies files that are new or modified. It does not delete files in the destination that have been removed from the source unless you add the --delete flag. Use --delete with care.

Verification

To confirm that the transfer completed successfully, list the files at the destination:

  • For S3:

    aws s3 ls s3://skao-sdp-testdata/path/to/dataset/
    
  • For shared storage:

    ls -lh /shared/fsx1/<your-project>/dataset/