How to start an interactive compute node on AWS =============================================== The headnode should not be used for running compute-intensive tasks. Instead, you should start an interactive compute node on AWS and run your tasks there. Related ------- - :doc:`How to load Spack modules on AWS ` - :doc:`How to use CARTA on AWS ` - `Accessing the clusters (Confluence) `_ Prerequisites ------------- - An account on AWS and access to the headnode. Steps ----- 1. Log into the headnode. 2. Decide what resources you need for your interactive job (partition, number of CPUs, amount of memory, maximum walltime). .. note:: You can check the available nodes from the headnode with the command: .. code-block:: bash sinfo -N -l .. note:: Use the default spot partition (``any-7i-24xl-spt``) where possible. More information on the available partitions can be found `here `_. .. warning:: Always set a maximum walltime to avoid your job running indefinitely. 3. Start an interactive node with the `srun command `_. For example, to start one node for one hour: .. code-block:: bash srun --partition=any-7i-24xl-spt --nodes=1 --ntasks=1 --cpus-per-task=8 --mem=64G --time=01:00:00 --pty bash -i .. note:: Use `tmux `_ or screen within your interactive session to avoid losing work if your connection drops. Verification ------------ After running the ``srun`` command, you should see a prompt indicating that you are now on the interactive compute node (e.g. ``user.name@any-7i-24xl-spt-dy-compute-1``). You can verify this by checking the hostname: .. code-block:: bash hostname FAQ --- **Q: Can I start multiple interactive compute nodes at the same time?** A: Yes, you can start multiple interactive compute nodes by running multiple ``srun`` commands in separate terminal sessions or tmux windows. **Q: What should I do if my interactive compute node is terminated unexpectedly?** A: If your interactive compute node is terminated unexpectedly, check the SLURM job logs for any error messages. You may need to restart your interactive session and re-run your tasks. .. code-block:: bash scontrol show job Troubleshooting --------------- Check the following if you encounter issues starting an interactive compute node: - Ensure that you have specified the correct partition and resources in your ``srun`` command. - Verify that there are available nodes in the specified partition using ``sinfo -N -l``. - Check for any error messages in the SLURM job logs that may indicate resource constraints or other issues. - Refer to the `SLURM documentation `_ for more detailed information on using SLURM commands and options.