Dask

Travis Build Status Chart version Dask version

Dask allows distributed computation in Python.

Chart Details

This chart will deploy the following in Kubernetes:

  • 1 x Dask scheduler with port 8786 (scheduler) and 8787 (Web UI) exposed on a ClusterIP (default)

  • 2 x Dask workers that connect to the scheduler

  • 1 x Jupyter lab notebook (optional, false by default) with port 8888 exposed on a ClusterIP (default)

  • 1 x Pipeline Job (optional, false by default) which can run a SDP pipeline

Note: only version 0.2.0 of the chart contains the Jupyter lab notebook pod.

Note: only version 0.3.0 of the chart contains Pipeline job. The pipeline job is not part of the original chart and was added to run SDP pipelines that can run with dask.

Tip: See the Kubernetes Service Type Docs for the differences between ClusterIP, NodePort, and LoadBalancer.

Installing the Chart

First we need to add the helmdeployer-charts repo to our local helm configuration.

helm repo add ska-sdp-helm https://artefact.skao.int/repository/helm-internal
helm repo update

To install the dask chart with the release name test:

helm install test ska-sdp-helm/ska-sdp-helmdeploy-dask

Depending on how your cluster was set up, you may also need to specify a namespace with the following flag: --namespace my-namespace.

Default Configuration

The following tables list the configurable parameters of the Dask chart and their default values. Note that the container images are not provided by default, you have to specify them in a custom values.yaml file or as a command line argument.

Dask scheduler

Parameter

Description

Default

scheduler.name

Dask scheduler name

scheduler

scheduler.image

Container image name

""

scheduler.imageTag

Container image tag

""

scheduler.replicas

k8s deployment replicas

1

scheduler.tolerations

Tolerations

[]

scheduler.nodeSelector

nodeSelector

{}

scheduler.affinity

Container affinity

{}

Dask webUI

Parameter

Description

Default

webUI.name

Dask webui name

webui

webUI.servicePort

k8s service port

80

webUI.ingress.enabled

Enable ingress controller resource

false

webUI.ingress.hostname

Ingress resource hostnames

dask-ui.example.com

webUI.ingress.tls

Ingress TLS configuration

false

webUI.ingress.secretName

Ingress TLS secret name

dask-scheduler-tls

webUI.ingress.annotations

Ingress annotations configuration

null

Dask worker

Parameter

Description

Default

worker.name

Dask worker name

worker

worker.image

Container image name

""

worker.imageTag

Container image tag

""

worker.replicas

k8s hpa and deployment replicas

2

worker.resources

Container resources

{}

worker.tolerations

Tolerations

[]

worker.nodeSelector

nodeSelector

{}

worker.affinity

Container affinity

{}

worker.port

Worker port (defaults to random)

""

Jupyter

Parameter

Description

Default

jupyter.name

Jupyter name

jupyter

jupyter.enabled

Include optional Jupyter server

false

jupyter.image

Container image name

""

jupyter.imageTag

Container image tag

""

jupyter.replicas

k8s deployment replicas

1

jupyter.servicePort

k8s service port

80

jupyter.resources

Container resources

{}

Pipeline Job

Parameter

Description

Default

pipeline.enabled

Run pipeline post deployment of cluster

false

pipeline.name

Name of pipeline

""

pipeline.command

Command to invoke

""

pipeline.args

Arguments to the pipeline

[]

pipeline.schedulerOptionName

Name of cli argument to pass scheduler IP

--dask-scheduler

Running an SDP pipeline

An SDP pipeline can be run as a k8s job in this helm chart. The job is started after both the dask scheduler and dask workers are created.

To avoid pickling overheads, the functions, scheduler, worker and job are all in the same image. This is necessary, especially if there are any calls to other non python libraries.

Note: The IP of the dask scheduler is added to the pipeline’s args as --dask-scheduler=<scheduler-service-ip>, this can be overridden with the pipeline.schedulerOptionName key.

Note: The job uses the volume information from the worker.volume.name and worker.volume.path keys. These are required for data access when running a pipeline.

  1. It is mandatory to pass in image name, command, args and the folder to be mounted along with a PVC. These information would typically be populated upstream by SDP system.

    ---
    # pipeline_values.yaml
    image: artefact.skao.int/ska-sdp-spectral-line-imaging:0.3.0
    
    worker:
     volume:
       name: pvc-mnt-data
       path: /mnt/data
    
    pipeline:
     enabled: true
     name: my-sdp-pipeline
     command: my-sdp-pipeline-cli
     schedulerOptionName: --dask-scheduler
     args:
     - run
     - --input
     - /mnt/data/input.ms
    
  2. Install the chart by running

    helm install -n <namespace> pipeline-test charts/dask --values pipeline_values.yaml
    

Custom Configuration

If you want to change the default parameters, you can do this in two ways.

YAML Config Files

You can update the default parameters in values.yaml by creating your own custom YAML config file with the updated parameters, and specifying this file via the -f flag when installing your chart. For example:

helm install test ska-sdp-helm/ska-sdp-helmdeploy-dask -f values.yaml

Command-Line Arguments

If you want to change the parameters for a specific install without changing values.yaml, you can use the --set key=value[,key=value] flag when running helm install, and it will override any default values. For example:

helm install test ska-sdp-helm/ska-sdp-helmdeploy-dask --set jupyter.enabled=false