.. raw:: html .. .. admonition:: The thing .. You can make up your own admonition too. ********************************** Container Orchestration Guidelines ********************************** This section describes a set of standards, conventions and guidelines for deploying application suites on Container Orchestration technologies. For a hands on training please check : `Containerisation and Orchestration Concept PDF `_ `Containerisation and Orchestration exercises `_ .. warning:: Information regarding examples on these pages may be out of date .. contents:: Table of Contents Overview of Standards ===================== These standards, best practices and guidelines are based on existing industry standards and tooling. The main references are: * `Cloud Native Computing Foundation `_. * `Docker v2 Registry API Specification `_. * `Container Network Interface `_. * `Container Storage Interface `_. * `Open Container Initiative image specification `_. * `Open Container Initiative run-time specification `_. The standards are broken down into the following areas: * Structuring application suites for orchestration - general guidelines for breaking up application suites for running in a container orchestration * Defining and building cloud native application suites - resource definitions, configuration, platform resource integration * Kubernetes primitives - a more detailed look at key components: Pods, Services, Ingress * Scheduling and running cloud native application suites - scheduling, execution, monitoring, logging, diagnostics, security considerations Throughout this documentation, `Kubernetes `_ in conjunction with `Helm `_ is used as the reference implementation with the canonical versions being Kubernetes v1.16.2 and Helm v3.1.2, however the aim is to target compliance with the OCI specifications and CNF guidelines so it is possible to substitute in alternative Container Orchestration solutions, and tooling. A set of example Helm Charts are provided in the repository `container-orchestration-chart-examples `_. These can be used to get an overall idea of how the components of a chart function together, and how the life cycle and management of a chart can be managed with ``make``. Structuring application suites for Orchestration ================================================ In order to understand how to structure applications suites for orchestration, we first need to understand what the goals of Cloud Native software engineering are. what is Cloud Native -------------------- It is the embodiment of modern software delivery practices supported by tools, frameworks, processes and platform interfaces. These capabilities are the next evolution of Cloud Computing, raising the level of abstraction for all actors against the architecture from the hardware unit to the application component. What does this mean? Developers and system operators (DevOps) interface with the platform architecture using abstract resource concepts, and should have next to no concern regarding the plumbing or wiring of the platform, while still being able to deploy and scale applications according to cost and usage. Cloud Native exploits the advantages of the Cloud Computing delivery model: * PaaS (Platform as a Service) layered on top of IaaS (Infrastructure as a Service) * CI/CD (Continuous Integration/Delivery) – fully automated build, test, deploy * Modern DevOps – auto-scaling, monitoring feedback loop to tune resource requirements * Software abstraction from platform compute, network, storage * Portability across Cloud Services providers Why Cloud Native SDLC (Software Development Life Cycle)? .. _figure-1-kubernetes-sdlc: .. figure:: ../images/Kubernetes-SDLC.dio.png :scale: 60% :alt: Cloud Native SDLC :align: center :figclass: figborder How Kubernetes fits into the Cloud Native SDLC Kubernetes provides cohesion for distributed projects: * Codify standards through implementing testing gates * Ensures code quality, consistency and predictability of deployment success – CI/CD * Automation – build AND rebuild for zero day exploits at little cost * Portability of SDI (Software Defined Infrastructure) as well as code * Provides a codified reference implementation of best practices, and exemplars * Enables broad engagement – an open and collaborate system - a “Social Coding Platform” * Consistent set of standards for integration with SRC (SKA Regional Centres), and other projects – the future platform of integrated science projects through shared resources enabled by common standards How does orchestration work --------------------------- At the core of Cloud Native is the container orchestration platform. For the purposes of these guidelines, this consists of Kubernetes as the orchestration layer, over Docker as the container engine. .. _figure-2-kubernetes-architecture: .. figure:: ../images/Kubernetes-Architecture.dio.png :scale: 60% :alt: Kubernetes Architecture :align: center :figclass: figborder The architecture of Kubernetes at the centre of the Cloud Native platform Kubernetes provides an abstraction layer from hardware infrastructure resources enabling compute, network, storage, and other dependent services (other applications) to be treated as abstract concepts. A computing cluster is not a collection of machines but instead is an opaque pool of resources, that are advertised for availability through a consistent REST based API. These resources can be customised to provide access to and accounting of specialised devices such as GPUs. Through the Kubernetes API, the necessary resources that make up an application suite (compute, network, storage) are addressed as objects in an idempotent way that declares the desired state eg: this number of Pods running these containers, backed by this storage, on that network. The scheduler will constantly move the cluster towards this desired state including in the event of application or node/hardware failure. This builds in robustness and auto-healing. See :doc:`kubernetes-introduction` for a general introduction. Both platform and service resources can be classified by performance characteristics and reservation criteria using labelling, which in turn are used by scheduling algorithms to determine optimum placement of workloads across the cluster. All applications are deployed as sets of one or more containers in a minimum configuration called a `Pod `_. Pods are the minimum scalable unit that are distributed and replicated across the cluster according to the scheduling algorithm. A Pod is essentially a single Kernel namespace holding one or more containers. It only makes sense to put together containers that are essentially tightly coupled and logically indivisible by design. These Pods can be scheduled in a number of patterns using `Controllers (full list) `_ including bare Pod (a single Pod instance), `Deployment `_ (a replicated Pod set), `StatefulSet `_ (a Deployment with certain guarantees about naming and ordering of replicated units), `DaemonSets `_ (one Pod per scheduled compute node), and `Job/CronJob `_ (run to completion applications). A detailed discussion of these features can be found in the main Kubernetes documentation under `Concepts `_. Structuring Application Suites ------------------------------ Architecting software to run in an orchestration environment builds on the guidelines given in the :ref:`Container Standards 'Structuring Containerised Applications' ` section. The key concepts of treating run time containers as immutable and atomic applications where any application state is explicitly dealt with through connections to storage mechanisms, is key. The application should be broken into components that represent: * an application component has an independent development lifecycle * individual process that performs a discrete task such as a micro service, specific database/web service, device, computational task etc. * component that exposes a specific service to another application eg. a micro service or database * a reusable component that is applicable to multiple application deployments eg. a co-routine or proximity depdendent service (logger, metrics collector, network helper, private database etc) * an independently scalable unit that can be replicated to match demand * the minimum unit required to match a resource profile at scheduling time such as storage, memory, cpu, specialised device Above all, design software to scale horizontally through a UNIX process model so that individual components that have independent scaling characteristics can be replicated independently. The application interface should be through the standard :ref:`container run time ` interface contract: * inputs come via a configurable Port * outputs go to a configurable network service * logging goes to stdout/stderr and syslog and uses JSON to enrich metadata (see :ref:`Container Standards 'Logging' `) * metrics are advertised via a standard such as `Prometheus Exporters `_, or emit metrics in a JSON format over TCP consumable by `ETL `_ services such as `LogStash `_ * configuration is passed in using environment variables, and simple configuration files (eg: ini, or key/value pairs). * POSIX compliant storage IO is facilitated by bind mounted volumes. * connections to DBMS, queuing technologies and object storage are managed through configuration. * applications should have builtin recoverability so that prior state and context is automatically discovered on restart. This enables the cluster to auto-heal by re-launching workloads on other resources when nodes fail (critical aspect of a micro-services architecture). By structuring an application in this fashion, it can scale from the single instance desktop development environment up to a large parallel deployment in production without needing to have explicit understanding builtin for the plumbing and wiring of each specific environment because this is handled through external configuration at the Infrastructure management layer. Example: Tango Controls ----------------------- To help illustrate the Cloud Native application architecture concepts, a walk through of a Tango application suite is used. A Tango Controller System environment is typically made up of the following: * Database containing the system state eg: MySQL. * DatabaseDS Tango device server. * One or more Tango devices. * Optional components - Tango REST interface, Tango logviewer, SysAdmin and debugging tools such as Astor and Jive. These components map to the following Kubernetes resources: * MySQL Database == StatefulSet. * DatabaseDS == Deployment or StatefulSet. * Tango REST interface == Deployment. * Tango Device == bare Pod, or single replica Deployment. This example does not take into consideration an HA deployment of MySQL, treating MySQL as a single instance StatefulSet. Using a StatefulSet in this case gives the following guarantees above a Deployment: * Stable unique network identifiers. * Stable persistent storage. * Ordered graceful deployment and scaling. * Ordered automated rolling updates. These characteristics are useful for stable service types such as databases and message queues. DatabaseDS is a stateless and horizontally scalable service in it's own right (state comes from MySQL). This makes it a fit for the Deployment (which in turn uses a ReplicaSet) or the StatefulSet. Deployments are a good fit for stateless components that require high availability through mechanisms such as rolling upgrades. The Tango Devices are single instance applications that act as a proxy between the 'real' hardware being controlled and the DatabaseDS service that provides each Tango Device with a gateway to the Tango cluster state database (MySQL). Considering that in most cases, an upgrade to a Device Pod is likely to be a delete and replace, we can use the simplest case of a bare Pod which will enable us to name each Pod after it's intended device without the random suffix generated for Deployments. Example: MPI jobs ----------------- A typical MPI application consists of a head node, and worker nodes with the (run to completion) job being launched from the head node, which in turn controls the work distribution over the workers. This can be broken in to: * a generic component type that covers head node and worker nodes. * a launcher that triggers the application on the designated head node. These components map to the following Kubernetes resources: * Worker node == DaemonSet or StatefulSet. * Launcher and Head node == Job. MPI jobs typically only require a single instance per physical compute node, and this is exactly the use case of DaemonSets where Kubernetes ensures exactly one instance of a Pod is running on each designated node. Using Jobs enables the launcher and the head node to be combined. Both Job and DaemonSet Pods will most likely need the same library and tools from MPI, so can be combined into a single container image. Linking Components Together --------------------------- Components of an application suite or even between suites should use `DNS `_ for service discovery. This is achieved by using the `Service `_ resource. Services should always be declared before Pods so that the automatic generation of associated Environment Variables happens in time for the subsequent Pods to discover them. Service names are permanent and predictable, and are tied to the `Namespace `_ that a application suite is deployed in, for example in the namespace ``test``, the DatabaseDS Tango component can find the MySQL database ``tangodb`` using the name ``tangodb`` or ``tangodb.test`` which is distinctly different to the instance running in the ``qa`` namespace also named ``tangodb`` but addressable by ``tangodb.qa``. This greatly simplifies configuration management for software deployment. Defining and building cloud native application suites ===================================================== All Kubernetes resource objects are described through the `REST based API `_. The representations of the API documents are in either JSON or YAML, however the preference is for YAML as the description language as this tends to be more human readable. The API representations are declarative, specifying the end desired state. It is up to the Kubernetes scheduler to make this a reality. It is important to use generic syntax and Kubernetes resource types. Specialised resource types reduce portability of resource descriptors and templates, and increase dependency on 3rd party integrations. This could lead to upgrade paralysis because the SDLC is out of our control. An example of this might be using a non-standard 3rd party Database Operator for MySQL instead of the official `Oracle `_ one. Metadata -------- Each resource is described with: * apiVersion - API version that this document should invoke * kind - resource type (object) that is to be handled * metadata - descriptive information including name, labels, annotations, namespace, ownership, references * spec(ification) - the body of the specification for this resource type denoted by `kind` The following is an example of the start of a StatefulSet for the Tango DatabaseDS: .. container:: toggle .. container:: header Resource description .. code:: yaml --- apiVersion: apps/v1 kind: StatefulSet metadata: name: databaseds-integration-tmc-webui-test labels: app.kubernetes.io/name: databaseds-integration-tmc-webui-test helm.sh/chart: integration-tmc-webui-0.1.0 app.kubernetes.io/instance: test app.kubernetes.io/managed-by: helm spec: ... Namespaces ~~~~~~~~~~ Even though it is possible to specify the namespace directly in the Metadata, it **SHOULD NOT** be, as this reduces the flexibility of any resource definition and templating solution employed such as Helm. The namespace can be specified at run time eg: ``kubectl --namespace test apply -f resource-file.yaml``. Name and Labels ~~~~~~~~~~~~~~~ Naming and labelling of all resources associated with a deployment should be consistent. This ensures that deployments that land in the same namespace can be identified along with all inter-dependencies. This is particulaly useful when using the ``kubectl`` command line tool as label based filtering can be employed to sieve out all related objects. Labels are entirely flexible and free form, but as a minimum specify: * the ``name`` and ``app.kubernetes.io/name`` with the same identifier with sufficient precision that the same application component deplyed in the same namespace can be distinguished eg: a concatenation of --. ``name`` and ``app.kubernetes.io/name`` are duplicated because label filter interaction between resources relies on labels eg: ``Service`` exposing ``Pods`` of a ``Deployment``. * the labels of the deployment suite such as the ``helm.sh/chart`` for Helm, including the version. * the ``app.kubernetes.io/instance`` (which is ``release``) of the deployment suite. * ``app.kubernetes.io/managed-by`` what tooling is used to manage this deployment - most likely ``helm``. Optional extras which are also useful for filtering are: * ``app.kubernetes.io/version`` the component version. * ``app.kubernetes.io/component`` the component type (most likely related to the primary container). * ``app.kubernetes.io/part-of`` what kind of application suite this component belongs to. The recommended core label set are described under `Kubernetes common labels `_. .. code:: yaml metadata: name: databaseds-integration-tmc-webui-test labels: app.kubernetes.io/name: databaseds-integration-tmc-webui-test helm.sh/chart: integration-tmc-webui-0.1.0 app.kubernetes.io/instance: test app.kubernetes.io/version: "1.0.3" app.kubernetes.io/component: databaseds app.kubernetes.io/part-of: tango app.kubernetes.io/managed-by: helm Using this labelling scheme enables filtering for all deployment related objects eg: ``kubectl get all -l helm.sh/chart=integration-tmc-webui-0.1.0,app.kubernetes.io/instance=test``. .. container:: toggle .. container:: header kubectl label filtering .. code:: bash $ kubectl get all,configmaps,secrets,pv,pvc -l helm.sh/chart=integration-tmc-webui-0.1.0,app.kubernetes.io/instance=test NAME READY STATUS RESTARTS AGE pod/databaseds-integration-tmc-webui-test-0 1/1 Running 0 55s pod/rsyslog-integration-tmc-webui-test-0 1/1 Running 0 55s pod/tangodb-integration-tmc-webui-test-0 1/1 Running 0 55s pod/tangotest-integration-tmc-webui-test 1/1 Running 0 55s pod/webjive-integration-tmc-webui-test-0 0/6 Init:0/1 0 55s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/databaseds-integration-tmc-webui-test ClusterIP None 10000/TCP 55s service/rsyslog-integration-tmc-webui-test ClusterIP None 514/TCP,514/UDP 55s service/tangodb-integration-tmc-webui-test ClusterIP None 3306/TCP 55s service/webjive-integration-tmc-webui-test ClusterIP 10.97.135.8 80/TCP,5004/TCP,3012/TCP,8080/TCP,27017/TCP 55s NAME READY AGE statefulset.apps/databaseds-integration-tmc-webui-test 1/1 55s statefulset.apps/rsyslog-integration-tmc-webui-test 1/1 55s statefulset.apps/tangodb-integration-tmc-webui-test 1/1 55s statefulset.apps/webjive-integration-tmc-webui-test 0/1 55s NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE persistentvolume/rsyslog-integration-tmc-webui-test 10Gi RWO Retain Bound default/rsyslog-integration-tmc-webui-test standard 56s persistentvolume/tangodb-integration-tmc-webui-test 1Gi RWO Retain Bound default/tangodb-integration-tmc-webui-test standard 55s persistentvolume/webjive-integration-tmc-webui-test 1Gi RWO Retain Bound default/webjive-integration-tmc-webui-test standard 55s NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/rsyslog-integration-tmc-webui-test Bound rsyslog-integration-tmc-webui-test 10Gi RWO standard 56s persistentvolumeclaim/tangodb-integration-tmc-webui-test Bound tangodb-integration-tmc-webui-test 1Gi RWO standard 55s persistentvolumeclaim/webjive-integration-tmc-webui-test Bound webjive-integration-tmc-webui-test 1Gi RWO standard 55s .. _orchestration-guidelines-templating-application: Templating the Application -------------------------- While it is entirely possible to define all the necessary resources for an application suite to be deployed on Kubernetes in individual or a single YAML file, this approach is static and quickly reveals it's limitations in terms of creating reusable and composable application suites. This is where `Helm Charts `_ have been adopted by the Kubernetes community as the leading templating solution for deployment. Helm provides a mechanism for generically describing an application suite, separating out configuration, and rolling out deployment releases all done in a declarative 'configuration as code' style. All Helm Charts should target a minimum of three environments: * Minikube - the standalone developer environment. * CI/CD - the Continuous Integration testing environment which is typically the same benchmark as Minikube. * Production Cluster - the target production Kubernetes environment. Minikube should be the default target environment for a Chart, as this will have the largest audience and should be optimised to work without modification of any configuration if possible. When designing a Chart it is important to have clear separation of concerns: * the application - essentially the containers to run. * configuration - any variables that influence the application run time. * resources - any storage, networking, configuration files, secrets, ACLs. The general structure of a Chart should follow: .. code:: bash charts/myapp/ Chart.yaml # A YAML file containing information about the chart and listing # dependencies for the chart (refer to Helm 2 vs Helm 3 differences). LICENSE # OPTIONAL: A plain text file containing the license for the chart README.md # OPTIONAL: A human-readable README file values.yaml # The default configuration values for this chart charts/ # A directory containing any charts upon which this chart depends. templates/ # A directory of templates that, when combined with values, # will generate valid Kubernetes manifest files. templates/NOTES.txt # OPTIONAL: A plain text file containing short usage notes templates/tests # A directory of test templates for running with 'helm test' All template files in the ``templates/`` directory should be named in a readily identifiable way after the component that it contains, and if further clarification is required then it should be suffixed with the ``Kind`` of resource eg: ``tangodb.yaml`` contains the ``StatefulSet`` for the Tango database, and ``tangodb-pv.yaml`` contains the ``PersistentVolume`` declaration for the Tango database. ``ConfigMaps`` should be clustered in ``configmaps.yaml`` and ``Secrets`` in ``secrets.yaml``. The aim is to make it easy for others to understand the layout of application suite being deployed. Helm sub-chart architecture ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Introduction to subcharts ************************* A chart can have one or more dependencies charts, called sub-charts. According to the `helm documentation `_: * a chart is stand-alone (cannot depend on a parent chart), * a sub-chart cannot access the values of its parent, * a parent sub-chart can override values for its sub-charts and * all charts (parent and sub-chart) can access the global values. Let's consider two charts, A and B where A depends on B. The file Chart.yaml for the chart A will specify the dependency and in the values file it is possible for chart A to override any value of chart B. The following figure shows how to do it: .. _figure-subcharts-1: .. figure:: ../images/A-parent-B.png :scale: 60% :alt: A parent of B :align: center :figclass: figborder Chart A parent of chart B It is also important to consider the `operational aspects of using dependencies `_ which state that when Helm installs/upgrades a chart, the Kubernetes objects from the chart and all its dependencies are * aggregated into a single set; then * sorted by type followed by name; and then * created/updated in that order. This means that if chart A defines the following k8s resources: * namespace "A-Namespace" * statefulset "A-StatefulSet" * service "A-Service" * and chart B defines the following k8s resources: * namespace “B-Namespace" * statefulset “B-ReplicaSet" * service “B-Service" Then the result of the helm install command for chart A will be: * A-Namespace * B-Namespace * A-Service * B-Service * B-ReplicaSet * A-StatefulSet. Subcharts architecture ********************** Considering the `Module Views `_ for the evolutionary prototype (section "Primary representation: MVP Uses in Kubernetes Deployment"), a partial dependency diagram for the helm charts available within the gitlab.com/ska-telescope group can be represented by the following diagram: .. _figure-subcharts-2: .. figure:: ../images/simple-skampi.png :scale: 60% :alt: Simple skampi diagram :align: center :figclass: figborder Simple skampi diagram All charts depend on the tango-base and, in general, all charts could need the archiver and the webjive interface. At the moment, this is modelled in skampi repository where there is one parent chart called skampi and all other charts are its subcharts. They are installed with Helm templating instead of normal installation There are a number of disadvantages in this model specifically: * Common testing: one place for all integration testing. No clear distinction between system and integration tests * Not easy to find logs: many tests on the same namespace * Same namespace for many deployments * No versioning: charts are not versioned Three solutions have been proposed and described in the `Supporting model `_ page: #. One parent chart (umbrella) that contains everything needed and Subcharts with no dependency #. Charts with dependencies and Subcharts enabled by levels #. Charts with dependencies and Subcharts enabled with conditions and tags The chosen solution is an hybrid approach which enables a single level hierarchy for the shared charts and umbrella charts for charts composition (i.e. specific deployment or testing purpose). The rational is: * Every chart can be deployed with its own tango eco-system * Every chart can have tango-base, webjive and the archiver as dependencies .. _figure-subcharts-3: .. figure:: ../images/tmc-shared-charts.png :scale: 60% :alt: Chart TMC with shared charts :align: center :figclass: figborder Chart TMC with shared charts Every dependency must have a common condition on it, so that it will be possible to disable the shared charts if they are included in the parent umbrella. For instance if there is the need (for testing purposes) to have the TMC and the OET charts together the result will be: .. _figure-subcharts-4: .. figure:: ../images/tmc-oet-umbrella.png :scale: 60% :alt: Umbrella chart with tmc and oet :align: center :figclass: figborder Umbrella chart with tmc and oet charts The initial model will become: .. _figure-subcharts-5: .. figure:: ../images/umbrella-skampi.png :scale: 60% :alt: Umbrella chart for skampi :align: center :figclass: figborder Umbrella chart for skampi: initial model refactored Gitlab Helm/k8s testing pipeline ******************************** In order to enable the GitLab pipeline to deploy and test the specific component each ska-telescope repository must: * contain at least one helm chart (i.e. starting point is skampi charts): `link to example `__ * have an environment (i.e. test): `link to example `__ * adopt the Makefile for k8s testing: `link to example `__ Also, note that each project/repository in the ska-telescope group has a `Kubernetes cluster already enabled `_. The test job of the GitLab pipeline needs to be: .. code:: yaml test: stage: test tags: - docker-executor image: artefact.skao.int/ska-cicd-k8s-tools-build-deploy:0.6.1 script: - kubectl version - make install-chart - make wait - make smoketest - make test after_script: - make uninstall-chart - make delete_namespace environment: name: test kubernetes: namespace: ci-$CI_PROJECT_NAME-$CI_COMMIT_SHORT_SHA artifacts: name: "$CI_PROJECT_NAME-$CI_JOB_ID" paths: - "charts/build" reports: junit: charts/build/report.xml where: * make install: installs the chart in the namespace specified in the environment tag * make wait: wait for all jobs to be completed and all pods to be running * make smoketest: checks that no containers are waiting * make test: 1. Create a pod into the specified namespace 2. Run pytests 3. Return the tests results * after_script: remove everything after tests The artifacts are the output of the tests and it will have the report both in xml and json but also other information like the pytest output. Tango-util library chart ************************ A library chart is a type of Helm chart that defines chart primitives or definitions which can be shared by Helm templates in other charts. In SKAMPI, many charts are a collections of device servers so it is possible to harmonize their definition with a library so to keep charts `DRY `_. The following diagram shows the data model for the harmonized values file: .. _figure-subcharts-6: .. figure:: ../images/values-data-model.png :scale: 100% :alt: Data model for the values file :align: center :figclass: figborder Data model for the values file Elements: +--------------------------+-----------------------------------------------------------------------------------------------+ | Element | Description | +--------------------------+-----------------------------------------------------------------------------------------------+ | Chart | collection of files that describe a related set of Kubernetes resources | +--------------------------+-----------------------------------------------------------------------------------------------+ | Values | built-in objects of helm which provides access to values passed into the chart for templating | +--------------------------+-----------------------------------------------------------------------------------------------+ | DsConfig | dsconfig file configuration | +--------------------------+-----------------------------------------------------------------------------------------------+ | DeviceServer | TANGO Device Server | +--------------------------+-----------------------------------------------------------------------------------------------+ | Device | TANGO device | +--------------------------+-----------------------------------------------------------------------------------------------+ | Global | Global values accessible by all charts | +--------------------------+-----------------------------------------------------------------------------------------------+ | Labels | to be added to all Kubernetes resources | +--------------------------+-----------------------------------------------------------------------------------------------+ | Environment variables | Name/Value pair available in shell | +--------------------------+-----------------------------------------------------------------------------------------------+ | Image | Detail of the docker image to be used | +--------------------------+-----------------------------------------------------------------------------------------------+ | ResourceRequestandLimits | struct for characterise the resource requests and limits for a device server | +--------------------------+-----------------------------------------------------------------------------------------------+ | DB | struct for characterise a DB software application | +--------------------------+-----------------------------------------------------------------------------------------------+ Rationale: * Almost all helm charts in the Skampi repository are device server configurations so it appears natural to start the modelling from that concept * The depends_on relationship has been added so that it is possible to extract the dependency map of the MVP prototype * Every chart of the ska-telescope can have the shared charts in the dependency list * The annotations block has been added to enable GitLab’s Deploy Boards * The DeviceServer struct specifies the shell args so that it is possible to start the related container instance of the linked image Advantages ********** With this architecture, a number of advantages can be obtained: * By using a separate deployment (i.e. Namespace) for each test, searching for all the logs of a particular test will be easy: example * Requires teams to create versions of docker images and charts * Avoids the use of docker-compose in favour of Kubernetes testing * Harmonized values yml files (for «common» definitions i.e. TANGO device servers) * Unit and integration testing within the repositories of teams * Skampi testing becomes system testing .. _helm-best-practices: Helm Best Practices ~~~~~~~~~~~~~~~~~~~ The Helm community have a well defined set of `best practices `_. The following highlights key aspects of these practices that will help with achieving consistency and reliability. * charts should be placed in a ``charts/`` directory within the parent project. * chart names should be lowercase and hyphenated and must match the directory name eg. ``charts/my-app``. * ``name``, ``version``, ``description``, ``home``, ``maintainers`` and ``sources`` must be included. * ``version`` must follow the `Semantic Versioning `_ standards. * the chart must pass the ``helm lint charts/`` test. .. warning:: **Helm 2 vs Helm 3** It should be noted that we have now migrated to using Helm 3. Feel free to upgrade Helm in your development environments using our Ansible Playbook ``upgrade_helm.yml`` found in the `SKA Ansible Playbooks repository `_. There are a few changes that may impact specific cases, to read up on them please read up at `This blog post `_, as well as on Helm's own `FAQ page `_. Example ``Chart.yaml`` file: .. code:: yaml name: my-app version: 1.0.0 description: Very important app keywords: - magic - mpi home: https://www.skatelescope.org/ icon: http://www.skatelescope.org/wp-content/uploads/2016/07/09545_NEW_LOGO_2014.png sources: - https://gitlab.com/ska-telescope/my-app maintainers: - name: myaccount email: myacount@skatelescope.org Metadata with Helm ~~~~~~~~~~~~~~~~~~ All resources should have the following boilerplate metadata to ensure that all resources can be uniquely identified to the chart, application and release: .. code:: yaml ... metadata: name: -{{ template "my-app.name" . }}-{{ .Release.Name }} labels: app.kubernetes.io/name: -{{ template "my-app.name" . }}-{{ .Release.Name }} helm.sh/chart: "{{ .Chart.Name }}-{{ .Chart.Version }}" app.kubernetes.io/instance: "{{ .Release.Name }}" app.kubernetes.io/managed-by: "{{ .Release.Service }}" ... Defining resources ~~~~~~~~~~~~~~~~~~ The `Helm templating language `_ is based on `Go template `_. All resources go in the ``templates/`` directory with the general rule is one Kubernetes resource per template file. Files that render resources are suffixed ``.yaml`` whilst files that contain expressions and macros only go in files suffixed ``.tpl``. .. container:: toggle .. container:: header Sample resource template for a Service generated by 'helm create mychart' .. code:: yaml apiVersion: v1 kind: Service metadata: name: {{ include "mychart.fullname" . }} labels: app.kubernetes.io/name: {{ include "mychart.name" . }} helm.sh/chart: {{ include "mychart.chart" . }} app.kubernetes.io/instance: {{ .Release.Name }} app.kubernetes.io/managed-by: {{ .Release.Service }} spec: type: {{ .Values.service.type }} ports: - port: {{ .Values.service.port }} targetPort: http protocol: TCP name: http selector: app.kubernetes.io/name: {{ include "mychart.name" . }} app.kubernetes.io/instance: {{ .Release.Name }} .. container:: toggle .. container:: header Expression or macro template generated by 'helm create mychart' .. code:: javascript {{/* vim: set filetype=mustache: */}} {{/* Expand the name of the chart. */}} {{- define "mychart.name" -}} {{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}} {{- end -}} {{/* Create a default fully qualified app name. We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec). If release name contains chart name it will be used as a full name. */}} {{- define "mychart.fullname" -}} {{- if .Values.fullnameOverride -}} {{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" -}} {{- else -}} {{- $name := default .Chart.Name .Values.nameOverride -}} {{- if contains $name .Release.Name -}} {{- .Release.Name | trunc 63 | trimSuffix "-" -}} {{- else -}} {{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}} {{- end -}} {{- end -}} {{- end -}} {{/* Create chart name and version as used by the chart label. */}} {{- define "mychart.chart" -}} {{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" -}} {{- end -}} | Tightly coupled resources may go in the same template file where they are logically linked or there is a form of dependency. An example of logically linked resources are PersistentVolume and PersistentVolumeClaim definitions. Keeping these together makes debugging and maintenance easier. .. _persistentvolume-and-persistentvolumeclaim-definition: .. container:: toggle .. container:: header PersistentVolume and PersistentVolumeClaim definitions .. code:: yaml --- kind: PersistentVolume apiVersion: v1 metadata: name: tangodb-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }} namespace: {{ .Release.Namespace }} labels: app.kubernetes.io/name: tangodb-{{ template "tango-chart-example.name" . }} app.kubernetes.io/instance: "{{ .Release.Name }}" app.kubernetes.io/managed-by: "{{ .Release.Service }}" helm.sh/chart: "{{ template "tango-chart-example.chart" . }}" spec: storageClassName: standard capacity: storage: 1Gi accessModes: - ReadWriteOnce hostPath: path: /data/tangodb-{{ template "tango-chart-example.name" . }}/ --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: tangodb-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }} namespace: {{ .Release.Namespace }} labels: app.kubernetes.io/name: tangodb-{{ template "tango-chart-example.name" . }} app.kubernetes.io/instance: "{{ .Release.Name }}" app.kubernetes.io/managed-by: "{{ .Release.Service }}" helm.sh/chart: "{{ template "tango-chart-example.chart" . }}" spec: storageClassName: standard accessModes: - ReadWriteOnce resources: requests: storage: 1Gi volumeName: tangodb-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }} An example of dependency is the declaration of a Service before the associated Pod/Deployment/StatefulSet/DaemonSet. The Pod will get the `environment variables `_ set from the Service as this will be evaluated by the Kubernetes API first as guaranteed by being in the same template file. .. container:: toggle .. container:: header Service before the associated Pod/Deployment .. code:: yaml --- apiVersion: v1 kind: Service metadata: name: tango-rest-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }} namespace: {{ .Release.Namespace }} labels: app.kubernetes.io/name: tango-rest-{{ template "tango-chart-example.name" . }} app.kubernetes.io/instance: "{{ .Release.Name }}" app.kubernetes.io/managed-by: "{{ .Release.Service }}" helm.sh/chart: "{{ template "tango-chart-example.chart" . }}" spec: type: ClusterIP ports: - name: rest port: 80 targetPort: rest protocol: TCP selector: app.kubernetes.io/name: tango-rest-{{ template "tango-chart-example.name" . }} app.kubernetes.io/instance: "{{ .Release.Name }}" --- apiVersion: extensions/v1beta1 kind: Deployment metadata: name: tango-rest-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }} namespace: {{ .Release.Namespace }} labels: app.kubernetes.io/name: tango-rest-{{ template "tango-chart-example.name" . }} app.kubernetes.io/instance: "{{ .Release.Name }}" app.kubernetes.io/managed-by: "{{ .Release.Service }}" helm.sh/chart: "{{ template "tango-chart-example.chart" . }}" spec: replicas: {{ .Values.tangorest.replicas }} template: metadata: labels: app.kubernetes.io/name: tango-rest-{{ template "tango-chart-example.name" . }} app.kubernetes.io/instance: "{{ .Release.Name }}" app.kubernetes.io/managed-by: "{{ .Release.Service }}" helm.sh/chart: "{{ template "tango-chart-example.chart" . }}" spec: containers: - name: tango-rest image: "{{ .Values.tangorest.image.registry }}/{{ .Values.tangorest.image.image }}:{{ .Values.tangorest.image.tag }}" imagePullPolicy: {{ .Values.tangorest.image.pullPolicy }} command: - /usr/local/bin/wait-for-it.sh - databaseds-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}:10000 - --timeout=30 - --strict - -- - /usr/bin/supervisord - --configuration - /etc/supervisor/supervisord.conf env: - name: TANGO_HOST value: databaseds-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}:10000 ports: - name: rest containerPort: 8080 protocol: TCP restartPolicy: Always {{- with .Values.nodeSelector }} nodeSelector: {{ toYaml . | indent 8 }} {{- end }} {{- with .Values.affinity }} affinity: {{ toYaml . | indent 8 }} {{- end }} {{- with .Values.tolerations }} tolerations: {{ toYaml . | indent 8 }} {{- end }} .. note:: It may also be necessary to consider the alphabetic ordering of template files, if there is a declaration dependency wider than the immediate file, for instance when s ``Service`` definition and it's environment variables are necessary for multiple Deployment/StatefulSet/DaemonSet definitions. In this case, it maybe necessary to use a numerical file prefix such as 00-service-and-pod.yaml, 01-db-statefulset.yaml ... Use comments liberally in the template files to describe the intended purpose of the resource declarations and any other features of the template markup. ``#`` YAML comments get copied through to the rendered template output and are a valuable help when debugging template issues with ``helm template charts/chart-name/ ...`` . .. _managing-configuration: Managing configuration ~~~~~~~~~~~~~~~~~~~~~~ Helm charts and the Go templating engine enable separation of application management concerns along multiple lines: * resources are broken out into related and named templates. * Application specific configuration values are placed in ``ConfigMaps``. * volatile run time configuration values are placed in the ``values.yaml`` file, and then templated into ``ConfigMaps``, container commandline parameters or environment variables as required. * sensitive configuration is placed in ``Secrets``. * template content is programable (iterators and operators) and this can be parameterised at template rendering time. Variable names for template substitution should observe the following rules: * Use camel-case or lowercase variable names - never hyphenated. * Structure parameter values in shallow nested structures to make it easier to pass on the Helm command line eg: ``--set tangodb.db.connection.host=localhost`` is convoluted compared to ``--set tangodb.host=localhost``. * Use explicitly typed values eg: ``enabled: false`` is not ``enabled: "false"``. * Be careful of how YAML parsers coerce value types - long integers get coerced into scientific notation so if in doubt use strings and type casting eg: ``foo: "12345678"`` and ``{{ .Values.foo | int }}``. * use comments in the ``values.yaml`` liberally to describe the intended purpose of variables. Config in ConfigMaps ~~~~~~~~~~~~~~~~~~~~ ``ConfigMaps`` can be used to `populate `_ ``Pod`` configuration files, environment variables and command line parameters where the values are largely stable, and should not be bundled with the container itself. This should include any (small) data artefacts that could be different (hence configured) between different instances of the running containers. Even files that already exist inside a given container image can be overwritten by using the ``volumeMounts`` example below. .. container:: toggle .. container:: header ConfigMap values in Pods .. code:: yaml --- apiVersion: v1 kind: ConfigMap metadata: name: special-config namespace: default data: SPECIAL_LEVEL: very SPECIAL_TYPE: charming example.ini: |- property.1=value-1 property.2=value-2 property.3=value-3 --- apiVersion: v1 kind: Pod metadata: name: dapi-test-pod spec: containers: - name: test-container image: k8s.gcr.io/busybox # accessing ConfigMap values in the commandline fron env vars command: [ "/bin/sh", "-c", "echo $(SPECIAL_LEVEL_KEY) $(SPECIAL_TYPE_KEY); cat /etc/config/example.ini" ] env: # reference the map and key to assign to env var - name: SPECIAL_LEVEL_KEY valueFrom: configMapKeyRef: name: special-config key: SPECIAL_LEVEL - name: SPECIAL_TYPE_KEY valueFrom: configMapKeyRef: name: special-config key: SPECIAL_TYPE volumeMounts: # mount a ConfigMap file blob as a configuration file - name: config-volume mountPath: /etc/config/example.ini subPath: example.ini readOnly: true volumes: - name: config-volume configMap: # Provide the name of the ConfigMap containing the files you want # to add to the container name: special-config restartPolicy: Never # check the logs with kubectl logs dapi-test-pod # clean up with kubectl delete pod/dapi-test-pod configmap/special-config Where configuration objects are large or have a sensitive format, then separate these out from the ``configmaps.yaml`` file, and then include them using the template directive: ``tpl (.Files.Glob "configs/*").AsConfig . )`` where the ``configs/`` directory is relative to the ``charts/my-chart`` directory. .. container:: toggle .. container:: header ConfigMap file blobs separated .. code:: yaml --- apiVersion: v1 kind: ConfigMap metadata: name: config-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }} labels: app.kubernetes.io/name: config-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }} app.kubernetes.io/instance: "{{ .Release.Name }}" app.kubernetes.io/managed-by: "{{ .Release.Service }}" helm.sh/chart: "{{ template "tango-chart-example.chart" . }}" data: {{ (tpl (.Files.Glob "configs/*").AsConfig . ) | indent 2 }} Secrets ~~~~~~~ ``Secrets`` information is treated in almost exactly the same way as ``ConfigMaps``. While the default configuration (as at v1.14.x) is for ``Secrets`` to be stored as Base64 encoded in the etcd database, it is possible and expected that the Kubernetes cluster will be configured with `encryption at rest (available from v1.13) `_. All account details, passwords, tokens, keys and certificates should be extracted and managed using ``Secrets``. As was for ``ConfigMaps``, separate ``Secrets`` out into the ``secrets.yaml`` template. .. container:: toggle .. container:: header Secret values in Pods .. code:: yaml --- apiVersion: v1 kind: Secret metadata: name: mysecret type: Opaque stringData: username: myuser password: mypassword config.yaml: |- apiUrl: "https://my.api.com/api/v1" username: myuser password: mypassword --- apiVersion: v1 kind: Pod metadata: name: secret-env-pod spec: containers: - name: mycontainer image: k8s.gcr.io/busybox # accessing Secret values in the commandline fron env vars command: [ "/bin/sh", "-c", "echo $(SECRET_USERNAME) $(SECRET_PASSWORD); cat /etc/config/example.yaml" ] env: - name: SECRET_USERNAME valueFrom: secretKeyRef: name: mysecret key: username - name: SECRET_PASSWORD valueFrom: secretKeyRef: name: mysecret key: password volumeMounts: - name: foo mountPath: "/etc/config" volumes: - name: foo secret: secretName: mysecret items: - key: config.yaml path: example.yaml mode: 511 restartPolicy: Never # check the logs with kubectl logs secret-env-pod # clean up with kubectl delete pod/secret-env-pod secret/mysecret Where sensitive data objects are large or have a sensitive format, then separate these out from the ``secrets.yaml`` file, and then include them using the template directive: ``tpl (.Files.Glob "secrets/*").AsSecrets . )`` where the ``secrets/`` directory is relative to the ``charts/my-chart`` directory. .. container:: toggle .. container:: header Secret file blobs separated .. code:: yaml --- apiVersion: v1 kind: Secret metadata: name: secret-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }} labels: app.kubernetes.io/name: secret-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }} app.kubernetes.io/instance: "{{ .Release.Name }}" app.kubernetes.io/managed-by: "{{ .Release.Service }}" helm.sh/chart: "{{ template "tango-chart-example.chart" . }}" type: Opaque data: {{ (tpl (.Files.Glob "secrets/*").AsSecrets . ) | indent 2 }} Storage ~~~~~~~ ``PersistentVolumes`` and partner ``PersistentVolumeClaims`` should be defined by default in a separate template. This template should be bracketed with a switch to enable the storage declaration to be :ref:`turned off ` (eg: ``{{ if .Values.tangodb.createpv }}``), which will most likely be dependent on, and optimised for each environment. On the ``PersistentVolume``: * All storage should be treated as ephemeral by setting ``persistentVolumeReclaimPolicy: Delete``. * Explicitly set volume mode eg: ``volumeMode: Filesystem`` so that it is clear whether ``Filesystem`` or ``Block`` is being requested. * Explicitly set the access mode eg: ``ReadWriteOnce, ReadOnlyMany, or ReadWriteMany`` so that it is clear what access rights containers are expected to have. * always specify the storage class - this should always default to ``standard`` eg: ``storageClassName: standard`` given that the default target environment is Minikube. On the ``PersistentVolumeClaim``: * Always specify the matching storage class eg: ``storageClassName: standard``, so that it will bind to the intended ``PersistentVolume`` storage class. * Where possible, always specify an explicit ``PersistentVolume`` with ``volumeName`` eg: ``volumeName: tangodb-tango-chart-example-test``. This will force the ``PersistentVolumeClaim`` to bind to a specific ``PersistentVolume`` and storage class, avoiding the loosely binding issues that volumes can have. Storage In Kubernetes Clusters Managed by the Systems Team ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In any of the existing deployed Kubernetes clusters there are a number of default StorageClasses available, that are backed by `Ceph `_, and integrated using `Rook `_. The ``StorageClass`` es expose ``RDB`` block devices and ``CephFS`` Network File System based storage to Kubernetes. The StorageClasses are as follows: +------------+----------------+--------------------------------------------+ | Classname | Maps to | Usage | +============+================+============================================+ | nfss1 | CephFS | Shared Network Filesystem - ReadWriteMany | +------------+----------------+--------------------------------------------+ | nfs | alias to nfss1 | Shared Network Filesystem - ReadWriteMany | +------------+----------------+--------------------------------------------+ | bds1 | RBD | Single concurrent use ext4 - ReadWriteOnce | +------------+----------------+--------------------------------------------+ | block | alias to bds1 | Single concurrent use ext4 - ReadWriteOnce | +------------+----------------+--------------------------------------------+ StorageClass naming convention follows the following pattern: ``[-]`` * xxx type - bd=block device, nfs=network filesystem * x class - s=standard,i=iops optimised (could be ssd/nvme), t=throughput optimised (could be hdd, or cheaper ssd) * n version - 1=first version,... * location - future tag for denoting location context, rack, dc, etc Current classes: * bds1 - block device - single mount (ReadWriteOnce) - standard - version 1 * nfss1 - network filesystem enabled storage (ReadWriteMany) - standard - version 1 * block = shortcut for bds1 * nfs = shortcut for nfss1 Tests ~~~~~ Helm Chart tests live in the ``templates/tests`` directory, and are essentially one ``Pod`` per file that must be run-to-completion (i.e. ``restartPolicy: Never``). These ``Pods`` are annotated in one of two ways: * ``"helm.sh/hook": test-success`` - ``Pod`` is expected to exit with return code ``0`` * ``"helm.sh/hook": test-failure`` - ``Pod`` is expected to exit with return code ``not equal 0`` This is a simple solution for test assertions at the ``Pod`` scale. As with any other resource definition, tests should have name and metadata correctly scoping them. End the ``Pod`` name with a string that indicates what the test is suffixed with ``-test``. Helm tests, must be self contained are should be atomic and non-destructive as the intention is that a chart user can use the tests to determine that the chart installed correctly. As with the following example, the test is for checking that ``Pods`` can reach the DatabaseDS service. Other tests might be checking services are correctly exposed via ``Ingress``. .. container:: toggle .. container:: header Helm Chart test Pod - metadata and annotations on a simple connection test .. code:: yaml --- apiVersion: v1 kind: Pod metadata: name: databaseds-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}-connection-test namespace: {{ .Release.Namespace }} labels: app.kubernetes.io/name: databaseds-{{ template "tango-chart-example.name" . }} app.kubernetes.io/instance: "{{ .Release.Name }}" app.kubernetes.io/managed-by: "{{ .Release.Service }}" helm.sh/chart: "{{ template "tango-chart-example.chart" . }}" annotations: "helm.sh/hook": test-success spec: {{- if .Values.pullSecrets }} imagePullSecrets: {{- range .Values.pullSecrets }} - name: {{ . }} {{- end}} {{- end }} containers: - name: databaseds-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}-connection-test image: "{{ .Values.powersupply.image.registry }}/{{ .Values.powersupply.image.image }}:{{ .Values.powersupply.image.tag }}" imagePullPolicy: {{ .Values.powersupply.image.pullPolicy }} command: - sh args: - -c - "( retry --max=10 -- tango_admin --ping-device test/power_supply/1 ) && echo 'test OK'" env: - name: TANGO_HOST value: databaseds-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}:10000 restartPolicy: Never .. _integrating-a-chart-into-the-skampi-repo: Integrating a chart into the SKAMPI repo ---------------------------------------- .. admonition:: Prerequisites - Verify that Docker, kubectl, Minikube and Helm are installed and working properly - refer to :ref:`verify-k8s`. - The required docker images have been uploaded to and are available from `Central Artefact Repository `_, see `docker upload instructions `__ To integrate a helm chart into the `SKAMPI` repo, follow these steps: Local steps ~~~~~~~~~~~ - Clone the `SKAMPI` repo, available `here `_. - Add a directory in `charts` with a descriptive name - Add your helm chart and associated files within that directory - Check the validity of the chart - Verify that the chart is formatted correctly .. code:: bash helm lint ./charts// - Verify that the templates are rendered correctly and the output is as expected .. code:: bash helm install --dry-run --debug ./charts// - For some debugging tips refer to: `debugging tips `_. - Check that your chart deploys locally (utilising minikube as per our standards) and behaves as expected .. code:: bash make deploy KUBE_NAMESPACE=integration make deploy KUBE_NAMESPACE=integration HELM_CHART= - Once functionality has been confirmed, go ahead and commit and push the changes Gitlab ~~~~~~ Once the changes had been pushed it will be built in Gitlab. Find the pipeline builds at `https://gitlab.com/ska-telescope/ska-skampi/pipelines `_. If the pipeline completes successfully, the full integration environment will be available at selected deployment environment which should be accessible from pipeline, i.e. http://k8s.stfc.skao.int for stfc. Kubernetes primitives ===================== The following focuses on the core Kubernetes primitives - Pod, Service, and Ingress. These provide the core delivery chain of a networked application to the end consumer. The Pod ------- The ``Pod`` is the basic deployable application unit in Kubernetes, and provides the primary configurable context of an application component. Within this construct, all configuration and resources are plugged in to the application. .. _patterns-and-lifecycle-hooks-examples: This is a complete example that demonstrates container patterns, initContainers and life-cycle hooks discussed in the following sections. .. container:: toggle .. container:: header Container patterns and life-cycle hooks .. code:: yaml --- kind: Service apiVersion: v1 metadata: name: pod-examples spec: type: ClusterIP selector: app: pod-examples ports: - name: http protocol: TCP port: 80 targetPort: http --- apiVersion: extensions/v1beta1 kind: Deployment metadata: name: pod-examples labels: app: pod-examples spec: replicas: 1 template: metadata: labels: app: pod-examples spec: volumes: # lifecyle containers as hooks share state using volumes - name: shared-data emptyDir: {} - name: the-end hostPath: path: /tmp type: Directory initContainers: # initContainers can initialise data, and do pre-flight checks - name: init-container image: alpine command: ['sh', '-c', "echo 'initContainer says: hello!' > /pod-data/status.txt"] volumeMounts: - name: shared-data mountPath: /pod-data containers: # primary data generator container - name: main-app-container image: alpine command: ["/bin/sh"] args: ["-c", "while true; do echo 'Main app says: ' `date` >> /pod-data/status.txt; sleep 5;done"] lifecycle: # postStart hook is async task called on Pod boot # useful for async container warmup tasks that are not hard dependencies # definitely not guaranteed to run before main container command postStart: exec: command: ["/bin/sh", "-c", "echo 'Hello from the postStart handler' >> /pod-data/status.txt"] # preStop hook is async task called on Pod termination # useful for initiating termination cleanup tasks # definitely not guaranteed to complete before container termination (sig KILL) preStop: exec: command: ["/bin/sh", "-c", "echo 'Hello from the preStop handler' >> /the-end/last.txt"] volumeMounts: - name: shared-data mountPath: /pod-data - name: the-end mountPath: /the-end # Sidecar helper that exposes data over http - name: sidecar-nginx-container image: nginx ports: - name: http containerPort: 80 protocol: TCP volumeMounts: - name: shared-data mountPath: /usr/share/nginx/html livenessProbe: httpGet: path: /index.html port: http readinessProbe: httpGet: path: /index.html port: http # Ambassador pattern used as a proxy or shim to access external inputs # gets date from Google and adds it to input - name: ambassador-container image: alpine command: ["/bin/sh"] args: ["-c", "while true; do echo 'Ambassador says: '`wget -S -q 'https://google.com/' 2>&1 | grep -i '^ Date:' | head -1 | sed 's/^ [Dd]ate: //g'` > /pod-data/input.txt; sleep 60; done"] volumeMounts: - name: shared-data mountPath: /pod-data # Adapter pattern used as a proxy or shim to generate/render outputs # fit for external consumption (similar to Sidecar) # reformats input data from sidecar and ambassador ready for output - name: adapter-container image: alpine command: ["/bin/sh"] args: ["-c", "while true; do cat /pod-data/status.txt | head -3 > /pod-data/index.html; cat /pod-data/input.txt | head -1 >> /pod-data/index.html; cat /pod-data/status.txt | tail -1 >> /pod-data/index.html; echo 'All from your friendly Adapter' >> /pod-data/index.html; sleep 5; done"] volumeMounts: - name: shared-data mountPath: /pod-data This will produce output that demonstrates each of the containers fulfilling their role: .. code:: bash $ curl http://`kubectl get service/pod-examples -o jsonpath="{.spec.clusterIP}"` initContainer says: hello! Main app says: Thu May 2 03:45:42 UTC 2019 Hello from the postStart handler Ambassador says: Thu, 02 May 2019 03:45:55 GMT Main app says: Thu May 2 03:46:12 UTC 2019 All from your friendly Adapter $ kubectl delete deployment/pod-examples service/pod-examples deployment.extensions "pod-examples" deleted service "pod-examples" deleted piers@wattle:~$ cat /tmp/last.txt Hello from the preStop handler Container patterns ~~~~~~~~~~~~~~~~~~ The ``Pod`` is a cluster of one or more containers that share the same resource namespaces. This enables the Pod cluster to communicate as though they are on the same host which is ideal for preserving the one-process-per-container ideal, but be able to deliver orchestrated processes as a single application that can be separately maintained. All ``Pod`` deployments should be designed around having a core or leading container. All other containers in the ``Pod`` provide auxillary or secondary services. There are three main patterns for multi-container ``Pods``: * Sidecar - extend the primary container functionality eg: adds logging, metrics, health checks (as input to livenessProbe/readinessProbe). * Ambasador - container that acts as an out-bound proxy for the primary container by handling translations to external services. * Adapter - container that acts as an in-bound proxy for the primary container aligning interfaces with alternative standards. initContainers ~~~~~~~~~~~~~~ Any serial container action that does not neatly fit into the one-process-per-container pattern, should be placed in an ``initContainer``. These are typically actions like initialising databases, checking for upgrade processes, executing migrations. ``initContainer`` are executed in order, and if any one of them fails, the ``Pod`` will be restarted inline with the ``restartPolicy``. With this behaviour, it is important to ensure that the ``initContainer`` actions are idempotent, or there will be harmful side effects on restarts. postStart/preStop ~~~~~~~~~~~~~~~~~ Life-cycle hooks have very few effective usecases as there is no guarantee that a ``postStart`` task will run before the main container command does (this is demonstrated above), and there is no guarantee that a ``preStop`` task (which is only issued when a Pod is terminated - not completed) will complete before the ``KILL`` signal is issued to the parent container after the cluster wide configured grace period (30s). The value of the lifecycle hooks are generally reserved for: * ``postStart`` - running an asynchronous non-critical task in the parent container that would otherwise slow down the boot time for the ``Pod`` and impact service availability. * ``preStop`` - initiating asynchronous clean up tasks via an external service - essentially an opportunity to send a quick message out before the ``Pod`` is fully terminated. readinessProbe/livenessProbe ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Readiness probes are used by the scheduler to determine whether the container is in a state ready to serve requests. Liveness probes are used by the scheduler to determine whether the container continues to be in a healthy state for serving requests. Where possible, ``livenessProbe`` and ``readinessProbe`` should be specified. This is automatically used to calculate whether a ``Pod`` is available and healthy and whether it should be added and load balanced in a ``Service``. These features can play an important role in the continuity of service when clusters are auto-healed, workloads are shifted from node to node, or during rolling updates to deployments. The following shows the registered probes and their status for the :ref:`sidecar container in the examples above `: .. code:: bash $ kubectl describe deployment.apps/pod-examples ... sidecar-nginx-container: Image: nginx Port: 80/TCP Host Port: 0/TCP Liveness: http-get http://:http/index.html delay=0s timeout=1s period=10s #success=1 #failure=3 Readiness: http-get http://:http/index.html delay=0s timeout=1s period=10s #success=1 #failure=3 Environment: Mounts: /usr/share/nginx/html from shared-data (rw) ... While probes can be a `command `_, it is better to make health checks an http service that is combined with an application `metrics handler `_ so that external applications can use the same feature to do health checking (eg: `Prometheus `_, or `Icinga `_). Sharing, Networking, Devices, Host Resource Access ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sharing resources is often the bottle neck in High Performance Computing, and where the greatest attention to detail is required with containerised applications in order to gain acceptable performance and efficency. Containers within a ``Pod`` can share resources with each other directly using shared volumes, network, and memory. These are the preferred methods because they are cross-platform portable for containers in general, Kubernetes and OS/hardware. The following example demonstrates how to share memory as a volume between containers: .. container:: toggle .. container:: header Pod containers sharing memory .. code:: yaml --- kind: Service apiVersion: v1 metadata: name: pod-sharing-memory-examples labels: app: pod-sharing-memory-examples spec: type: ClusterIP selector: app: pod-sharing-memory-examples ports: - name: ncat protocol: TCP port: 5678 targetPort: ncat --- apiVersion: extensions/v1beta1 kind: Deployment metadata: name: pod-sharing-memory-examples labels: app: pod-sharing-memory-examples spec: replicas: 1 template: metadata: labels: app: pod-sharing-memory-examples spec: containers: # Producer - write to shared memory - name: producer-container image: python:3.7 command: ["/bin/sh"] args: ["-c", "python3 /src/mmapexample.py -p; sleep infinity"] volumeMounts: - name: src mountPath: /src/mmapexample.py subPath: mmapexample.py readOnly: true - mountPath: /dev/shm name: dshm # Consumer - read from shared memory and publish on 5678 - name: consumer-container image: python:3.7 command: ["/bin/sh"] # mutating container - this is bad practice but we need netcat for this example args: ["-c", "apt-get update; apt-get -y install netcat-openbsd; python3 -u /src/mmapexample.py | nc -l -k -p 5678; sleep infinity"] ports: - name: ncat containerPort: 5678 protocol: TCP volumeMounts: - name: src mountPath: /src/mmapexample.py subPath: mmapexample.py readOnly: true - mountPath: /dev/shm name: dshm volumes: - name: src configMap: name: pod-sharing-memory-examples - name: dshm emptyDir: medium: Memory # test with: # $ nc `kubectl get service/pod-sharing-memory-examples -o jsonpath="{.spec.clusterIP}"` 5678 # Producers says: 2019-05-05 19:21:10 # Producers says: 2019-05-05 19:21:11 # Producers says: 2019-05-05 19:21:12 # $ kubectl delete deployment,svc,configmap -l app=pod-sharing-memory-examples # deployment.extensions "pod-sharing-memory-examples" deleted # service "pod-sharing-memory-examples" deleted # configmap "pod-sharing-memory-examples" deleted # debug with: kubectl logs -l app=pod-sharing-memory-examples -c producer-container --- apiVersion: v1 kind: ConfigMap metadata: name: pod-sharing-memory-examples labels: app: pod-sharing-memory-examples data: mmapexample.py: |- #!/usr/bin/env python3 # -*- coding: utf-8 -*- """ example mmap python client """ import datetime import time import getopt import os import os.path import sys import logging from collections import namedtuple import mmap import signal def parse_opts(): """ Parse out the command line options """ options = { 'mqueue': "/example_shared_memory_queue", 'debug': False, 'producer': False } try: (opts, _) = getopt.getopt(sys.argv[1:], 'dpm:', ["debug", "producer" "mqueue="]) except getopt.GetoptError: print('mmapexample.py [-d -p -m ]') sys.exit(2) dopts = {} for (key, value) in opts: dopts[key] = value if '-p' in dopts: options['producer'] = True if '-m' in dopts: options['mqueue'] = dopts['-m'] if '-d' in dopts: options['debug'] = True # container class for options parameters option = namedtuple('option', options.keys()) return option(**options) # main def main(): """ Main """ options = parse_opts() # setup logging logging.basicConfig(level=(logging.DEBUG if options.debug else logging.INFO), format=('%(asctime)s [%(name)s] ' + '%(levelname)s: %(message)s')) logging.info('mqueue: %s mode: %s', options.mqueue, ('Producer' if options.producer else 'Consumer')) # trap the keyboard interrupt def signal_handler(signal_caught, frame): """ Catch the keyboard interrupt and gracefully exit """ logging.info('You pressed Ctrl+C!: %s/%s', signal_caught, frame) sys.exit(0) signal.signal(signal.SIGINT, signal_handler) mqueue_fd = os.open("/dev/shm/" + options.mqueue, os.O_RDWR | os.O_SYNC | os.O_CREAT) last = "" while True: try: if options.producer: now = datetime.datetime.now() data = "Producers says: %s\n" % \ (now.strftime("%Y-%m-%d %H:%M:%S")) logging.debug('sending out to mqueue: %s', data) os.ftruncate(mqueue_fd, 512) with mmap.mmap(mqueue_fd, 0) as mqueue: mqueue.seek(0) mqueue[0:len(data)] = data.encode('utf-8') mqueue.flush() else: with mmap.mmap(mqueue_fd, 0, access=mmap.ACCESS_READ) as mqueue: mqueue.seek(0) data = mqueue.readline().rstrip().decode('utf-8') logging.debug('from mqueue: %s', data) if data == last: logging.debug('same as last time - skipping') else: last = data sys.stdout.write(data+"\n") sys.stdout.flush() except Exception as ex: # pylint: disable=broad-except logging.debug('error: %s', repr(ex)) time.sleep(1) logging.info('Finished') sys.exit(0) # main if __name__ == "__main__": main() The following example demonstrates how to share memory over POSIX IPC between containers: .. container:: toggle .. container:: header Pod containers sharing memory over POSIX IPC .. code:: yaml --- kind: Service apiVersion: v1 metadata: name: pod-ipc-sharing-examples labels: app: pod-ipc-sharing-examples spec: type: ClusterIP selector: app: pod-ipc-sharing-examples ports: - name: ncat protocol: TCP port: 1234 targetPort: ncat --- apiVersion: extensions/v1beta1 kind: Deployment metadata: name: pod-ipc-sharing-examples labels: app: pod-ipc-sharing-examples spec: replicas: 1 template: metadata: labels: app: pod-ipc-sharing-examples spec: volumes: - name: shared-data emptyDir: {} initContainers: # get and build the ipc shmem tool - name: builder-container image: golang:1.11 command: ['sh', '-c', "export GOPATH=/src; go get gitlab.com/ghetzel/shmtool"] volumeMounts: - name: shared-data mountPath: /src containers: # Producer - name: producer-container image: alpine command: ["/bin/sh"] args: - "-c" - > apk add -U util-linux; mkdir /lib64 && ln -s /lib/libc.musl-x86_64.so.1 /lib64/ld-linux-x86-64.so.2; ipcmk --shmem 1KiB; echo "ipcmk again as chmtool cant handle 0 SHMID"; ipcmk --shmem 1KiB; > /pod-data/memaddr.txt; while true; do echo 'Main app (pod-ipc-sharing-examples) says: ' `date` | /pod-data/bin/shmtool open -s 1024 `ipcs -m | cut -d' ' -f 2 | sed '/^$/d' | tail -1`; sleep 1; done volumeMounts: - name: shared-data mountPath: /pod-data # Consumer - read from the pipe and publish on 1234 - name: consumer-container image: alpine command: ["/bin/sh"] args: - "-c" - > apk add --update coreutils util-linux; mkdir /lib64 && ln -s /lib/libc.musl-x86_64.so.1 /lib64/ld-linux-x86-64.so.2; sleep 3; (while true; do /pod-data/bin/shmtool read `ipcs -m | cut -d' ' -f 2 | sed '/^$/d' | tail -1`; sleep 1; done) | stdbuf -i0 nc -l -k -p 1234 ports: - name: ncat containerPort: 1234 protocol: TCP volumeMounts: - name: shared-data mountPath: /pod-data # test with: # $ nc `kubectl get service/pod-ipc-sharing-examples -o jsonpath="{.spec.clusterIP}"` 1234 # Main app (pod-ipc-sharing-examples) says: Tue May 7 20:46:03 UTC 2019 # Main app (pod-ipc-sharing-examples) says: Tue May 7 20:46:04 UTC 2019 # Main app (pod-ipc-sharing-examples) says: Tue May 7 20:46:05 UTC 2019 # $ kubectl delete deployment,svc -l app=pod-ipc-sharing-examples # deployment.extensions "pod-ipc-sharing-examples" deleted # service "pod-ipc-sharing-examples" deleted The following example demonstrates how to share over a named pipe between containers: .. container:: toggle .. container:: header Pod containers sharing over named pipe .. code:: yaml --- kind: Service apiVersion: v1 metadata: name: pod-sharing-examples labels: app: pod-sharing-examples spec: type: ClusterIP selector: app: pod-sharing-examples ports: - name: ncat protocol: TCP port: 1234 targetPort: ncat --- apiVersion: extensions/v1beta1 kind: Deployment metadata: name: pod-sharing-examples labels: app: pod-sharing-examples spec: replicas: 1 template: metadata: labels: app: pod-sharing-examples spec: volumes: # lifecyle containers as hooks share state using volumes - name: shared-data emptyDir: {} initContainers: # Setup the named pipe for inter-container communication - name: init-container image: alpine command: ['sh', '-c', "mkfifo /pod-data/piper"] volumeMounts: - name: shared-data mountPath: /pod-data containers: # Producer - name: producer-container image: alpine command: ["/bin/sh"] args: ["-c", "while true; do echo 'Main app (pod-sharing-examples) says: ' `date` >> /pod-data/piper; sleep 1;done"] volumeMounts: - name: shared-data mountPath: /pod-data # Consumer - read from the pipe and publish on 1234 - name: consumer-container image: alpine command: ["/bin/sh"] args: ["-c", "apk add --update coreutils; tail -f /pod-data/piper | stdbuf -i0 nc -l -k -p 1234"] ports: - name: ncat containerPort: 1234 protocol: TCP volumeMounts: - name: shared-data mountPath: /pod-data # test with: # $ nc `kubectl get service/pod-sharing-examples -o jsonpath="{.spec.clusterIP}"` 1234 # Main app says: Thu May 2 20:48:56 UTC 2019 # Main app says: Thu May 2 20:49:53 UTC 2019 # Main app says: Thu May 2 20:49:56 UTC 2019 # $ kubectl delete deployment,svc -l app=pod-sharing-examples # deployment.extensions "pod-sharing-examples" deleted # service "pod-sharing-examples" deleted The following example demonstrates how to share over the localhost network between containers: .. container:: toggle .. container:: header Pod containers sharing over localhost network .. code:: yaml --- kind: Service apiVersion: v1 metadata: name: pod-sharing-network-examples labels: app: pod-sharing-network-examples spec: type: ClusterIP selector: app: pod-sharing-network-examples ports: - name: ncat protocol: TCP port: 5678 targetPort: ncat --- apiVersion: extensions/v1beta1 kind: Deployment metadata: name: pod-sharing-network-examples labels: app: pod-sharing-network-examples spec: replicas: 1 template: metadata: labels: app: pod-sharing-network-examples spec: containers: # Producer - name: producer-container image: alpine command: ["/bin/sh"] args: ["-c", "apk add --update coreutils; (while true; do echo 'Main app (pod-sharing-network-examples) says: ' `date`; sleep 1; done) | stdbuf -i0 nc -lk -p 1234"] # Consumer - read from the local port and publish on 5678 - name: consumer-container image: alpine command: ["/bin/sh"] args: ["-c", "apk add --update coreutils; nc localhost 1234 | stdbuf -i0 nc -l -k -p 5678"] ports: - name: ncat containerPort: 5678 protocol: TCP # test with: # $ nc `kubectl get service/pod-sharing-network-examples -o jsonpath="{.spec.clusterIP}"` 5678 # Main app says: Thu May 2 20:48:56 UTC 2019 # Main app says: Thu May 2 20:49:53 UTC 2019 # Main app says: Thu May 2 20:49:56 UTC 2019 # $ kubectl delete deployment,svc -l app=pod-sharing-network-examples # deployment.extensions "pod-sharing-network-examples" deleted # service "pod-sharing-network-examples" deleted Performance driven networking requirements are a concern with HPC. Often the solution is to bind an application directly to a specific host network adapter. Historically, the solution for this in containers has been to escalate the privileges of the container so that it is running in the host namespace, and this is achieved in in Kubernetes using the following approach: .. code:: yaml ... spec: containers: - name: my-privileged-container securityContext: privileged: true ... This **SHOULD** be avoided at all costs. This pushes the container into the host namespace for processes, network and storage. A critical side effect of this is that any port that the container consumes can conflict with host services, and will mean that **ONLY** a single instance of this container can run on any given host. Outside of these functional concerns, it is a serious source of security breach as the privileged container has full (root) access to the node including any applications (and containers) running there. To date, the only valid exceptions discovered have been: * Core daemon services running for the Kubernetes and OpenStack control plane that are deployed as containers but are node level services. * Storage, Network, or Device Kubernetes plugins that need to deploy OS kernel drivers. As a first step to resolving a networking issue, the Kubernetes and Platform management team should always be approached to help resolve architectural issues to avoid this approach. In the event of not being able to reconcile the requirement, then the following ``hostNetwork`` solution should be attempted first: .. code:: yaml ... spec: containers: - name: my-hostnetwork-container securityContext: hostNetwork: true Use of Services --------------- ``Service`` resources should be defined in the same template file as the associated application deployment and ordered at the top. This will ensure that service related environment variables will be passed into the deployment at scheduling time. It is good practice to only have a single ``Service`` resource per deployment that covers the port mapping/exposure for each application port. It is also important to only have one deployment per ``Service`` as it will make debugging considerably harder mapping a ``Service`` to more than one application. As part of this, ensure that the ``selector`` definition is specific to the fully qualified deployment including release and version to prevent leakage across multiple deployment versions. Fully qualify port definitions with ``name``, ``port``, ``protocol`` and ``targetPort`` so that the interface is self documenting. Using names for ``targetPort`` the same as ``name`` is encouraged as this can give useful hints as to the function of the container interface. .. container:: toggle .. container:: header Service resource with fully qualified port description and specific selector .. code:: yaml --- apiVersion: v1 kind: Service metadata: name: tango-rest-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }} namespace: {{ .Release.Namespace }} labels: app.kubernetes.io/name: tango-rest-{{ template "tango-chart-example.name" . }} app.kubernetes.io/instance: "{{ .Release.Name }}" app.kubernetes.io/managed-by: "{{ .Release.Service }}" helm.sh/chart: "{{ template "tango-chart-example.chart" . }}" spec: type: ClusterIP ports: - name: rest protocol: TCP port: 80 targetPort: rest selector: app.kubernetes.io/name: tango-rest-{{ template "tango-chart-example.name" . }} app.kubernetes.io/instance: "{{ .Release.Name }}" ``type: ClusterIP`` is the default and should almost always be used and declared. ``NodePort`` should only be used under exceptional circumstances as it will reserve a fixed port on the underlying node using up the limited node port address range resource. Only expose ports that are actually needed external to the deployment. This will help reduce clutter and reduce the surface area for attack on an application. Use of Ingress -------------- A Helm chart represents an application to be deployed, so it follows that it is best practice to have a single ``Ingress`` resource per chart. This represents the single frontend for an application that exposes it to the outside world (relative to the Kubernetes cluster). If a chart seemingly requires multiple hostnames and/or has services that want to inhabit the same port or URI space, then consider splitting this into multiple charts so that the component application can be published independently. It is useful to parameterise the control of SSL/TLS configuration so that this can be opted in to in various deployment strategies (as below). .. container:: toggle .. container:: header One Ingress per chart with TLS parameterised .. code:: yaml --- apiVersion: extensions/v1beta1 kind: Ingress metadata: name: rest-api-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }} labels: app.kubernetes.io/name: rest-{{ template "tango-chart-example.name" . }} app.kubernetes.io/instance: "{{ .Release.Name }}" app.kubernetes.io/managed-by: "{{ .Release.Service }}" helm.sh/chart: "{{ template "tango-chart-example.chart" . }}" annotations: {{- range $key, $value := .Values.ingress.annotations }} {{ $key }}: {{ $value | quote }} {{- end }} spec: rules: - host: {{ .Values.ingress.hostname }} http: paths: - path: / backend: serviceName: tango-rest-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }} servicePort: 80 {{- if .Values.ingress.tls.enabled }} tls: - secretName: {{ tpl .Values.ingress.tls.secretname . }} hosts: - {{ tpl .Values.ingress.hostname . }} {{- end -}} Scheduling and running cloud native application suites ====================================================== Security -------- Security covers many things, but this section will focus on RBAC and network Policies. Roles ~~~~~ Kubernetes will implement `role based access control `_ which will be used to control external and internal user access to scheduling and consuming resources. While it is possible to create ``serviceAccounts`` to modify the privileges for a deployment, this should generally be avoided so that the access control profile of the deploying user can be inherited at launch time. Do not create ``ClusterRole`` and ``ClusterRoleBinding`` resources and/or allocate these to ``ServiceAccounts`` used in a deployment as these have extended system wide access rights. ``Role`` and ``RoleBinding`` are scoped to the deployment ``Namespace`` so limit the scope for damage. Pod Security Policies ~~~~~~~~~~~~~~~~~~~~~ Pod Security Policies will affect what can be requested in the `securityContext `_ section. It should be assumed that Kubernetes clusters will run restrictive `Pod security policies `_, so it should be expected that: * ``Pods`` do not need to access resources outside the current ``Namespace``. * ``Pods`` do not run as ``privileged: true`` and will not have privilege escalation. * ``hostNetwork`` activation will require discussion with operations. * ``hostIPC`` will be unavailable. * ``hostPID`` will be unavailable. * Containers should run as a non-root user. * host ports will be restricted. * host paths will be restricted (``hostPath`` mounts). * it maybe required to have read only root filesystem (layer in container). * `Capabilities `_ maybe dropped and a restricted list put in place to determine what can be added. * it should be expected that the ``default`` service account credentials will **NOT** be mounted into the running containers by default - applications should rarely need to query the Kubernetes API, so access will be removed by default. In general, only system level deployments such as Kubernetes control plane components (eg: adminsion controllers, device drivers, Operators, etc.) are the only deployments that should have cluster level rights. Network Policies ~~~~~~~~~~~~~~~~ Explicit `Network Policies `_ are encouraged to restrict unintended access across deployments, and to secure applications from some forms of intrusion. The following restricts access to the deployed TangoDB to only the DatabaseDS application. .. container:: toggle .. container:: header One Ingress per chart with TLS parameterised .. code:: yaml --- apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: tangodb-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}-network-policy spec: podSelector: matchLabels: app.kubernetes.io/name: tangodb-{{ template "tango-chart-example.name" . }} app.kubernetes.io/instance: "{{ .Release.Name }}" policyTypes: - Ingress - Egress ingress: - from: - podSelector: # enable the DatabaseDS interface matchLabels: app.kubernetes.io/name: databaseds-{{ template "tango-chart-example.name" . }} app.kubernetes.io/instance: "{{ .Release.Name }}" ports: - name: ds protocol: TCP port: 10000 egress: - to: # anywhere in the standard Pod Network address range to all ports - ipBlock: cidr: 10.0.0.0/16 Images, Tags, and pullPolicy ---------------------------- Only use images from trusted sources. In most cases this should be only from the `official SKA Central Artefact Repository `_, with a few exceptions such as the core vender supported images for key services such as `MySQL `_. It is anticipated that in the future the SKA will host mirrors and/or pull-through caches for key external software components, and will then firewall off access to external repositories that are not explicitly trusted. As a general rule, stable image tags should be used for images that at least include the Major and Minor version number of `Semantic Versioning `_ eg: ``mysql:5.27``. As curated images come from trusted sources, this ensures that the deployment process gets a functionally stable starting point that will still accrue bug fixing and security patching over time. Do **NOT** use the ``latest`` tag as it is likely that this will break your application in future as it gives no way of guaranteeing feature parity and stability. In Helm Charts, it is good practice to parameterise the registry, image and tag of each container so that these can be varied in different environment deployments by changing ``values``. Also parameterise the ``pullPolicy`` so that communication with the registry at container boot time can be easily turned on and off. .. code:: yaml ... containers: - name: tangodb image: "{{ .Values.tangodb.image.registry }}/{{ .Values.tangodb.image.image }}:{{ .Values.tangodb.image.tag }}" imagePullPolicy: {{ .Values.tangodb.image.pullPolicy }} Resource reservations and constraints ------------------------------------- Compute platform level `resources `_ encompass: * Memory. * CPU. * Plugin based devices. * `Extended resources `_ - configured node level logical resources. Resources can be either specified in terms of: * Limits - the maximum amount of resource a container is allowed to consume before it maybe restarted or evicted. * Requests - the amount of resource a container requires to be available before it will be scheduled. Limits and requests are specified at the individual container level: .. code:: yaml ... containers: - name: tango-device-thing resources: requests: cpu: 4000m # 4 cores memory: 512M # 0.5GB skatelescope.org/widget: 3 limits: cpu: 8000m # 8 cores memory: 1024M # 1GB Resource requirements should be explicitly set both in terms of requests and limits (not normally applicable to extended resources) as this can be used by the scheduler to determine load balancing policy, and to determine when an application is misbehaving. These parameters should be set as configured ``values.yaml`` parameters. Restarts -------- Containers should be designed to cleanly crash - the main process should exit on a fatal error (no internal restart). This then will ensure that the configured ``livenessProbe`` and ``readinessProbe`` function correctly and where necessary, remove the affected ``Pod`` from ``Services`` ensuring that there are no dead service connections. Logging ------- The SKA has adopted :doc:`/tools/logging-format` as the logging standard to be used by all SKA software. This should be considered a base line standard and will be decorated with additional data by an infrastructure wide integrated logging solution (eg: `ElasticStack `_). To ensure compliance with this, all containers must log to ``stdout/stderr`` and/or be configured to log to ``syslog``. Connection to ``syslog`` should be configurable using :ref:`standard container mechanisms ` such as mounted files (handled by ``ConfigMaps``) or environment variables. This will ensure that any deployed application can be automatically plugged into the infrastructure wide logging and monitoring solution. A simple way to achieve this is to use a logging client library that is dynamically configurable for output destination such as ``import logging`` for ``Python``. Metrics ------- Each ``Pod`` should have an application metrics handler that emits the `adopted container standard format `_. For efficency purposes this should be amalgamated with the ``livenessProbe`` and ``readinessProbe``. Scheduling ---------- Scheduling in Kubernetes enables the resources of the entire cluster to be allocated using a fine grained model. These resources can be partitioned according to user policies, namespaces, and quotas. The default scheduler is a comprehensive rules processing engine that should be able to satisfy most needs. The primary mechanism for routing incoming tasks to execution is by having a labelling system throughout the cluster that reflects the distribution profile of workloads and types of resources required, coupled with Node and Pod affinity/anti-affinity rules. These are applied like a sieve to the available resources that the Scheduler keeps track of to determine if resources are available and where the next Pod can be placed. Scheduling on Kubernetes behaves similarly to a force directed graph, in that the tensions between the interdependent rules form the pressures of the spring bars that influence relative placement across the cluster. When creating scheduling constraints, attempt to keep them as generic as possible. Concentrate on declaring rules related to the individual Helm chart and the current chart in relation to any dependent charts (`subcharts `_). Avoid coding in node specific requirements. Often it is more efficient to outsource the rules to the ``values.yaml`` file as they are almost guaranteed to change between environments. .. code:: yaml --- ... {{- with .Values.nodeSelector }} nodeSelector: {{ toYaml . | indent 8 }} {{- end }} {{- with .Values.affinity }} affinity: {{ toYaml . | indent 8 }} {{- end }} {{- with .Values.tolerations }} tolerations: {{ toYaml . | indent 8 }} {{- end }} ... Always remember that the Kubernetes API is `declarative `_ and expect that deployments will use the ``apply`` semantics of `kubectl `_, with the scheduler constantly trying to move the system towards the desired state as and when resources become available as well as in response to failures. This means that scheduling is not guaranteed, so any downstream depedencies must be able to cope with that (also a tenent of micro-services architecture). Examples of scheduling control patterns ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The below scheduling scenarios are run using the following conditions: * container replicas launched using a sleep command in busybox, defined in a StatefulSet. * Specific node. * Type of node. * Density - 1 per node, n per node. * Position next another Pod - specific Pod, or Pod type. * Soft and hard rules. * A four node cluster - master and three minions. * The nodes have been split into two groups: rack01 - k8s-master-0 and k8s-minion-0, and rack02 - k8s-minion-1, and k8s-minion-2. * The master node has the labels: node-role.kubernetes.io/headnode, and node-role.kubernetes.io/master. The aim is to demonstrate how the scheduler works, and how to configure for the common use cases. obs1 and obs2 - nodeAffinity ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use nodeSelector to force all 3 replicas onto ``rack: rack01`` for obs1-rack01 and ``rack02`` for obs2-rack02: .. container:: toggle .. container:: header node select rack01 for obs1-rack01 and rack02 for obs2-rack02 .. code:: yaml --- apiVersion: apps/v1 kind: StatefulSet metadata: name: obs1-rack01 labels: group: scheduling-examples app: obs1 spec: replicas: 3 selector: matchLabels: app: obs1 serviceName: obs1 template: metadata: labels: group: scheduling-examples app: obs1 annotations: description: node select rack01 spec: containers: - image: busybox:1.28.3 name: obs1-rack01 command: ["sleep", "365d"] nodeSelector: rack: rack01 --- apiVersion: apps/v1 kind: StatefulSet metadata: name: obs2-rack02 labels: group: scheduling-examples app: obs2 spec: replicas: 3 selector: matchLabels: app: obs2 serviceName: obs2 template: metadata: labels: group: scheduling-examples app: obs2 annotations: description: node select rack02 spec: containers: - image: busybox:1.28.3 name: obs2-rack02 command: ["sleep", "365d"] nodeSelector: rack: rack02 Scenario obs1 - run 3 Pods on hosts allocated to rack01. Only nodes master-0, and minion-0 are used reflecting rack01. .. code:: bash NAME DESC STATUS NODE obs1-rack01-0 node select rack01 Running k8s-master-0 obs1-rack01-1 node select rack01 Running k8s-minion-0 obs1-rack01-2 node select rack01 Running k8s-master-0 and for Scenario obs2 - run 3 Pods on hosts allocated to rack02. Only minion-1 and minion-2 are used reflecting rack02. .. code:: bash NAME DESC STATUS NODE obs2-rack02-0 node select rack02 Running k8s-minion-2 obs2-rack02-1 node select rack02 Running k8s-minion-1 obs2-rack02-2 node select rack02 Running k8s-minion-2 obs3 - nodeAffinity exclussion ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use nodeAffinity ``operator: NotIn`` rules to exclude the master node from scheduling: .. container:: toggle .. container:: header nodeAffinity NotIn master .. code:: yaml --- apiVersion: apps/v1 kind: StatefulSet metadata: name: obs3-node-affinity-not-master labels: group: scheduling-examples app: obs3 spec: replicas: 4 selector: matchLabels: app: obs3 serviceName: obs3 template: metadata: labels: group: scheduling-examples app: obs3 annotations: description: nodeAffinity NotIn master spec: containers: - image: busybox:1.28.3 name: obs3-node-affinity-not-master command: ["sleep", "365d"] affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node-role.kubernetes.io/master operator: NotIn values: - "" Scenario obs3 - run 4 Pods on any host so long as they are not labelled node-role.kubernetes.io/master. In this case minion-0 and minion-1 have been selected minion-2 could also have been used. .. code:: bash NAME DESC STATUS NODE obs3-node-affinity-not-master-0 nodeAffinity NotIn master Running k8s-minion-1 obs3-node-affinity-not-master-1 nodeAffinity NotIn master Running k8s-minion-0 obs3-node-affinity-not-master-2 nodeAffinity NotIn master Running k8s-minion-1 obs3-node-affinity-not-master-3 nodeAffinity NotIn master Running k8s-minion-0 obs4 - nodeAntiAffinity ~~~~~~~~~~~~~~~~~~~~~~~ Use podAffinity (hard requiredDuringSchedulingIgnoredDuringExecution) to position on the same node as obs1-rack01, and nodeAntiAffinity to (soft preferredDuringSchedulingIgnoredDuringExecution) exclude the node labelled 'node-role.kubernetes.io/headnode' from scheduling: .. container:: toggle .. container:: header podAffinity require obs1-rack01, nodeAntiAffinity prefer headnode .. code:: yaml --- apiVersion: apps/v1 kind: StatefulSet metadata: name: obs4-pod-affinity-obs1-pref-not-headnode labels: group: scheduling-examples app: obs4 spec: replicas: 5 selector: matchLabels: app: obs4 serviceName: obs4 template: metadata: labels: group: scheduling-examples app: obs4 annotations: description: podAffinity req obs1, nodeAntiAffinity pref headnode spec: containers: - image: busybox:1.28.3 name: obs4-pod-affinity-obs1-pref-not-headnode command: ["sleep", "365d"] affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - obs1 topologyKey: kubernetes.io/hostname nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: node-role.kubernetes.io/headnode operator: NotIn values: - "" Scenario obs4 - run 5 Pods using required Pod Affinity with obs1 and preferred Node Anti Affinity with headnode (master label). Pods have been scheduled on minion-0 and master-0 as this is where obs1 is. This is further compounded by the anti affinity rule with headnode where only one replica is on master-0. .. code:: bash NAME DESC STATUS NODE obs4-pod-affinity-obs1-pref-not-headnode-0 podAffinity req obs1, nodeAntiAffinity pref headnode Running k8s-minion-0 obs4-pod-affinity-obs1-pref-not-headnode-1 podAffinity req obs1, nodeAntiAffinity pref headnode Running k8s-minion-0 obs4-pod-affinity-obs1-pref-not-headnode-2 podAffinity req obs1, nodeAntiAffinity pref headnode Running k8s-minion-0 obs4-pod-affinity-obs1-pref-not-headnode-3 podAffinity req obs1, nodeAntiAffinity pref headnode Running k8s-master-0 obs4-pod-affinity-obs1-pref-not-headnode-4 podAffinity req obs1, nodeAntiAffinity pref headnode Running k8s-minion-0 obs5 - podAntiAffinity ~~~~~~~~~~~~~~~~~~~~~~ Use podAntiAffinity (hard requiredDuringSchedulingIgnoredDuringExecution) to ensure only one instance of self per node (topologyKey: "kubernetes.io/hostname"), and podAffinity to require a position on the same node as obs3: .. container:: toggle .. container:: header podAntiAffinity require self and podAffinity require obs3 .. code:: yaml --- apiVersion: apps/v1 kind: StatefulSet metadata: name: obs5-pod-one-per-node-and-obs3 labels: group: scheduling-examples app: obs5 spec: replicas: 5 selector: matchLabels: app: obs5 serviceName: obs5 template: metadata: labels: group: scheduling-examples app: obs5 annotations: description: podAntiAffinity req self, podAffinity req obs3 spec: containers: - image: busybox:1.28.3 name: obs5-pod-one-per-node-and-obs3 command: ["sleep", "365d"] affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - obs5 topologyKey: "kubernetes.io/hostname" podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - obs3 topologyKey: "kubernetes.io/hostname" Scenario obs5 - run 3 Pods using required Pod Anti Affinity with self (force schedule one per node) and require Pod Affinity with obs3. This has forced scheduling of one per node, and because obs3 is only running on two different nodes the 3rd replica is in a constant state of Pending. Pod Affinity is described with a topology key that is obs6 - Taint NoSchedule ~~~~~~~~~~~~~~~~~~~~~~~ kubernetes.io/hostname ie. the node identifier. The topology key sets the scope for implementing the rule, so could be a node, a group of nodes, an OS or device classificaton etc. .. code:: bash NAME DESC STATUS NODE obs5-pod-one-per-node-and-obs3-0 podAntiAffinity req self, podAffinity req obs3 Running k8s-minion-0 obs5-pod-one-per-node-and-obs3-1 podAntiAffinity req self, podAffinity req obs3 Running k8s-minion-1 obs5-pod-one-per-node-and-obs3-2 podAntiAffinity req self, podAffinity req obs3 Pending First, the master node is `tainted `_ to disallow scheduling with ``kubectl cordon ``. Use nodeSelector to force all 3 replicas onto ``rack: rack01``, but this will fail to schedule as the taint will not allow it so subsequently forced onto minion-0: .. container:: toggle .. container:: header node select rack01, but trapped by Taint NoSchedule .. code:: yaml --- # kubectl taint nodes k8s-master-0 key1=value1:NoSchedule, or kubectl cordon k8s-master-0 apiVersion: apps/v1 kind: StatefulSet metadata: name: obs6-rack01-taint labels: group: scheduling-examples app: obs6 spec: replicas: 3 selector: matchLabels: app: obs6 serviceName: obs6 template: metadata: labels: group: scheduling-examples app: obs6 annotations: description: node select rack01, but trapped by Taint NoSchedule spec: containers: - image: busybox:1.28.3 name: obs6-rack01-taint command: ["sleep", "365d"] nodeSelector: rack: rack01 The resulting schedule is: .. code:: bash NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE obs6-rack01-taint-0 1/1 Running 0 32s 192.168.105.180 k8s-minion-0 obs6-rack01-taint-1 1/1 Running 0 31s 192.168.105.177 k8s-minion-0 obs6-rack01-taint-2 1/1 Running 0 29s 192.168.105.181 k8s-minion-0 For obs6, a StatefulSet that has nodeSelector: .. code:: yaml nodeSelector: rack: rack01 The result shows that of the two nodes (ks-master-0, and k8s-minion-0) in rack01, only k8s-minion-0 is available for these Pods. obs7 - add toleration ~~~~~~~~~~~~~~~~~~~~~~ Repeat obs6 as obs7 but add a toleration to the NoSchedule taint: .. container:: toggle .. container:: header node select rack01, with Toleration to Taint NoSchedule .. code:: yaml --- apiVersion: apps/v1 kind: StatefulSet metadata: name: obs7-rack01-taint-and-toleration labels: group: scheduling-examples app: obs7 spec: replicas: 3 selector: matchLabels: app: obs7 serviceName: obs7 template: metadata: labels: group: scheduling-examples app: obs7 annotations: description: node select rack01, with Tolleration to Taint NoSchedule spec: containers: - image: busybox:1.28.3 name: obs7-rack01-taint-and-toleration command: ["sleep", "365d"] nodeSelector: rack: rack01 tolerations: - key: "key1" operator: "Equal" value: "value1" effect: "NoSchedule" Now with the added a Toleration to the Taint, we have the following: .. code:: bash NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE obs7-rack01-taint-and-toleration-0 1/1 Running 0 33s 192.168.105.184 k8s-minion-0 obs7-rack01-taint-and-toleration-1 1/1 Running 0 32s 192.168.72.27 k8s-master-0 obs7-rack01-taint-and-toleration-2 1/1 Running 0 31s 192.168.105.182 k8s-minion-0 For a StatefulSet that has nodeSelector and Tolerations: .. code:: yaml nodeSelector: rack: rack01 tolerations: - key: "key1" operator: "Equal" value: "value1" effect: "NoSchedule" The result shows that the two nodes k8s-master-0, and k8s-minion-0 in rack01, are available for these Pods.