How to start an interactive compute node on AWS
The headnode should not be used for running compute-intensive tasks. Instead, you should start an interactive compute node on AWS and run your tasks there.
Prerequisites
An account on AWS and access to the headnode.
Steps
Log into the headnode.
Decide what resources you need for your interactive job (partition, number of CPUs, amount of memory, maximum walltime).
Note
You can check the available nodes from the headnode with the command:
sinfo -N -l
Note
Use the default spot partition (
any-7i-24xl-spt) where possible. More information on the available partitions can be found here.Warning
Always set a maximum walltime to avoid your job running indefinitely.
Start an interactive node with the srun command. For example, to start one node for one hour:
srun --partition=any-7i-24xl-spt --nodes=1 --ntasks=1 --cpus-per-task=8 --mem=64G --time=01:00:00 --pty bash -i
Note
Use tmux or screen within your interactive session to avoid losing work if your connection drops.
Verification
After running the srun command, you should see a prompt indicating that you are now on the
interactive compute node (e.g. user.name@any-7i-24xl-spt-dy-compute-1). You can verify this by
checking the hostname:
hostname
FAQ
Q: Can I start multiple interactive compute nodes at the same time?
A: Yes, you can start multiple interactive compute nodes by running multiple srun commands in
separate terminal sessions or tmux windows.
Q: What should I do if my interactive compute node is terminated unexpectedly?
A: If your interactive compute node is terminated unexpectedly, check the SLURM job logs for any error messages. You may need to restart your interactive session and re-run your tasks.
scontrol show job <job_id>
Troubleshooting
Check the following if you encounter issues starting an interactive compute node:
Ensure that you have specified the correct partition and resources in your
sruncommand.Verify that there are available nodes in the specified partition using
sinfo -N -l.Check for any error messages in the SLURM job logs that may indicate resource constraints or other issues.
Refer to the SLURM documentation for more detailed information on using SLURM commands and options.