.. _dask_scaling: ************ Dask Scaling ************ `Dask `_ is a Python library for distributing computation across multiple workers. We use it to parallelise processing. Overview ======== A dask cluster is made of: - A scheduler, a lightweight process that coordinates task execution. - Workers, that execute the tasks. They may be running on remote nodes and perform the heavy processing. .. note:: The dask scheduler, the dask workers and the batch pre-processing pipeline application must be running in the same Python environment. Using BPP with dask =================== Here we outline how to start a dask cluster on your machine and have BPP use it. Scheduler --------- In one terminal, activate the batch pre-processing Python environment and launch the scheduler. Assuming you followed the instructions in :ref:`installation` to set up the Python environment: .. code-block:: bash cd source .venv/bin/activate dask scheduler This starts the scheduler on the default port 8786. Workers ------- In another terminal, launch the desired number of workers (4 in this example). They will share the CPU cores and RAM of your machine evenly. .. code-block:: bash cd source .venv/bin/activate dask worker localhost:8786 --nworkers 4 --resources "process=1" .. warning:: ``--resources "process=1"`` is required -- explanation below. Dashboard --------- You may now open the dask dashboard in a browser, which runs on port 8787 by default. From there you can observe the progression of tasks in real time; simply navigate to ``_. Pipeline App ------------ Batch pre-processing can now be launched and use the cluster we just started. Just use the same command line with the ``-d`` or ``--dask-scheduler`` option: .. code-block:: bash ska-sdp-batch-preprocess run -d localhost:8786 Why worker resources? --------------------- The pipeline CLI app runs DP3 tasks, which use multiple threads. Dask has no mechanism to detect how many threads a task uses, and assumes that every task uses 1 thread from the worker's own Python ``ThreadPool``. This is problematic when running C/C++ code spawning its own pool of threads on the side, like DP3. The only reliable solution is to use `worker resources `_. The batch pre-processing pipeline assumes that all workers define a resource called ``process``; each worker holds 1, and each DP3 task is defined as requiring 1. When a DP3 task reaches a worker, DP3 is launched with the same number of threads as the worker officially owns. A worker thus only ever runs one task at a time, and all threads are used without risk of over-subscription. The drawback is that resources can only be defined when the workers are launched; hence the need to add ``--resources "process=1"`` to the worker launch command.