.. _dask_scaling:
************
Dask Scaling
************
`Dask `_ is a Python library for distributing computation across multiple workers.
We use it to parallelise processing.
Overview
========
A dask cluster is made of:
- A scheduler, a lightweight process that coordinates task execution.
- Workers, that execute the tasks. They may be running on remote nodes and perform the heavy processing.
.. note::
The dask scheduler, the dask workers and the batch pre-processing pipeline application
must be running in the same Python environment.
Using BPP with dask
===================
Here we outline how to start a dask cluster on your machine and have BPP use it.
Scheduler
---------
In one terminal, activate the batch pre-processing Python environment and launch the scheduler.
Assuming you followed the instructions in :ref:`installation` to set up the Python environment:
.. code-block:: bash
cd
source .venv/bin/activate
dask scheduler
This starts the scheduler on the default port 8786.
Workers
-------
In another terminal, launch the desired number of workers (4 in this example).
They will share the CPU cores and RAM of your machine evenly.
.. code-block:: bash
cd
source .venv/bin/activate
dask worker localhost:8786 --nworkers 4 --resources "process=1"
.. warning::
``--resources "process=1"`` is required -- explanation below.
Dashboard
---------
You may now open the dask dashboard in a browser, which runs on port 8787 by default.
From there you can observe the progression of tasks in real time;
simply navigate to ``_.
Pipeline App
------------
Batch pre-processing can now be launched and use the cluster we just started.
Just use the same command line with the ``-d`` or ``--dask-scheduler`` option:
.. code-block:: bash
ska-sdp-batch-preprocess run -d localhost:8786
Why worker resources?
---------------------
The pipeline CLI app runs DP3 tasks, which use multiple threads.
Dask has no mechanism to detect how many threads a task uses, and assumes that
every task uses 1 thread from the worker's own
Python ``ThreadPool``. This is problematic when running C/C++ code spawning its
own pool of threads on the side, like DP3.
The only reliable solution is to use
`worker resources `_.
The batch pre-processing pipeline assumes that all workers define a resource
called ``process``; each worker holds 1, and each DP3 task is defined as
requiring 1. When a DP3 task reaches a worker, DP3 is launched with the same
number of threads as the worker officially owns. A worker thus only ever runs
one task at a time, and all threads are used without risk of over-subscription.
The drawback is that resources can only be defined when the workers are
launched; hence the need to add ``--resources "process=1"`` to the worker launch
command.