.. _dask_scaling:

************
Dask Scaling
************

`Dask <https://docs.dask.org/en/stable/>`_ is a Python library for distributing computation across multiple workers.
We use it to parallelise processing.


Overview
========

A dask cluster is made of:

- A scheduler, a lightweight process that coordinates task execution.
- Workers, that execute the tasks. They may be running on remote nodes and perform the heavy processing.

.. note::

    The dask scheduler, the dask workers and the batch pre-processing pipeline application
    must be running in the same Python environment.


Using BPP with dask
===================

Here we outline how to start a dask cluster on your machine and have BPP use it.

Scheduler
---------

In one terminal, activate the batch pre-processing Python environment and launch the scheduler.
Assuming you followed the instructions in :ref:`installation` to set up the Python environment:

.. code-block:: bash

    cd <BPP_REPOSITORY>
    source .venv/bin/activate
    dask scheduler

This starts the scheduler on the default port 8786.

Workers
-------

In another terminal, launch the desired number of workers (4 in this example).
They will share the CPU cores and RAM of your machine evenly.


.. code-block:: bash

    cd <BPP_REPOSITORY>
    source .venv/bin/activate
    dask worker localhost:8786 --nworkers 4 --resources "process=1"

.. warning::
    
    ``--resources "process=1"`` is required -- explanation below.

Dashboard
---------

You may now open the dask dashboard in a browser, which runs on port 8787 by default.
From there you can observe the progression of tasks in real time;
simply navigate to `<https://localhost:8787>`_.


Pipeline App
------------

Batch pre-processing can now be launched and use the cluster we just started.
Just use the same command line with the ``-d`` or ``--dask-scheduler`` option:

.. code-block:: bash

    ska-sdp-batch-preprocess run -d localhost:8786 <OPTIONS> <INPUT>


Why worker resources?
---------------------

The pipeline CLI app runs DP3 tasks, which use multiple threads.
Dask has no mechanism to detect how many threads a task uses, and assumes that
every task uses 1 thread from the worker's own
Python ``ThreadPool``. This is problematic when running C/C++ code spawning its
own pool of threads on the side, like DP3.

The only reliable solution is to use
`worker resources <https://distributed.dask.org/en/latest/resources.html#worker-resources>`_.
The batch pre-processing pipeline assumes that all workers define a resource
called ``process``; each worker holds 1, and each DP3 task is defined as
requiring 1. When a DP3 task reaches a worker, DP3 is launched with the same
number of threads as the worker officially owns. A worker thus only ever runs
one task at a time, and all threads are used without risk of over-subscription.

The drawback is that resources can only be defined when the workers are
launched; hence the need to add ``--resources "process=1"`` to the worker launch
command.