How to start an interactive compute node on AWS

The headnode should not be used for running compute-intensive tasks. Instead, you should start an interactive compute node on AWS and run your tasks there.

Prerequisites

  • An account on AWS and access to the headnode.

Steps

  1. Log into the headnode.

  2. Decide what resources you need for your interactive job (partition, number of CPUs, amount of memory, maximum walltime).

    Note

    You can check the available nodes from the headnode with the command:

    sinfo -N -l
    

    Note

    Use the default spot partition (any-7i-24xl-spt) where possible. More information on the available partitions can be found here.

    Warning

    Always set a maximum walltime to avoid your job running indefinitely.

  3. Start an interactive node with the srun command. For example, to start one node for one hour:

    srun --partition=any-7i-24xl-spt --nodes=1 --ntasks=1 --cpus-per-task=8 --mem=64G --time=01:00:00 --pty bash -i
    

    Note

    Use tmux or screen within your interactive session to avoid losing work if your connection drops.

Verification

After running the srun command, you should see a prompt indicating that you are now on the interactive compute node (e.g. user.name@any-7i-24xl-spt-dy-compute-1). You can verify this by checking the hostname:

hostname

FAQ

Q: Can I start multiple interactive compute nodes at the same time?

A: Yes, you can start multiple interactive compute nodes by running multiple srun commands in separate terminal sessions or tmux windows.

Q: What should I do if my interactive compute node is terminated unexpectedly?

A: If your interactive compute node is terminated unexpectedly, check the SLURM job logs for any error messages. You may need to restart your interactive session and re-run your tasks.

scontrol show job <job_id>

Troubleshooting

Check the following if you encounter issues starting an interactive compute node:

  • Ensure that you have specified the correct partition and resources in your srun command.

  • Verify that there are available nodes in the specified partition using sinfo -N -l.

  • Check for any error messages in the SLURM job logs that may indicate resource constraints or other issues.

  • Refer to the SLURM documentation for more detailed information on using SLURM commands and options.