Pipeline configuration file: ``pipelines.json``
===============================================

This JSON file contains parameters for all the pipelines that are used by different types of
scheduling blocks. Each root level key is a different pipeline name and this contains a dictionary
of pipeline parameter key-value pairs.

Example pipelines are defined in the example configuration file found in the repository. You can
add additional pipelines as needed.

Parameters
----------

- ``description``: Description of the pipeline
- ``node_hours_mean``: Standardised unit (Nh) of computation representing the compute capability of
  1 Intel Ice Lake node for 1 hour
- ``node_hours_uncertainty``: Uncertainty as a fraction of node hours used for sampling in Monte
  Carlo simulations
- ``pct_parallelism_min``: Percentage parallelism of pipeline - represents how scalable the pipeline
  is. Used for setting lower limits in Monte Carlo simulations.
- ``pct_parallelism_max``: Maximum percentage parallelism of pipeline used for setting upper limits
  in Monte Carlo simulations
- ``data_product_storage_gb``: Storage (GB) required for intermediate data products
- ``num_nodes``: Number of compute nodes to allocate for this pipeline
- ``priority``: Optional priority of the pipeline - used to determine the order in which pipelines
  are scheduled. Lower numbers represent higher priority. Higher priority pipelines will be
  scheduled first. If none are specified, the default is to execute all pipelines in series for a
  given scheduling block instance. If any are specified these will be executed in order of priority,
  with the remaining pipelines executed in parallel after the last priority pipeline.

.. note::

    Values for ``node_hours_mean``, ``pct_parallelism_min`` and ``pct_parallelism_max`` provided in
    the pipelines configuration file are rough estimates. Better estimates should be obtained from
    benchmarking.

.. note::

    Values for ``num_nodes`` will be ignored during the optimisation process as this is one of the
    parameters that will be optimised.

.. note::

    When running simulations without Monte Carlo iterations (default), the ``node_hours_mean`` and
    ``pct_parallelism_min`` parameters are used directly as values for ``node_hours`` and
    ``pct_parallelism`` and these will remain fixed throughout the simulation. When Monte Carlo
    iterations are specified, these values will be sampled at each iteration -- the
    ``pct_parallelism`` parameter from a uniform distribution between ``pct_parallelism_min`` and
    ``pct_parallelism_max`` and the ``node_hours`` parameter from a zero-truncated normal
    distribution with mean ``node_hours_mean`` and standard deviation of ``node_hours_mean *
    node_hours_uncertainty``.

.. note::

    Values for `node_hours` and `pct_parallelism` provided in the pipelines configuration file are
    rough estimates. Better estimates should be obtained from benchmarking.

.. note::

    Values for `num_nodes` will be ignored during the optimisation process as this is one of the
    parameters that will be optimised. If using Monte Carlo simulations (optional when running the
    optimisation) the `pct_parallelism` parameter will be sampled from a uniform distribution
    between `pct_parallelism` and `pct_parallelism_max` and the `node_hours` parameter will be
    sampled from a zero-truncated normal distribution with mean `node_hours` and standard deviation
    of ``node_hours * node_hours_uncertainty``.

.. note::

    The data retention period for data products is set in the scheduling block types configuration
    file (see :doc:`here </usage/inputs/configuration/scheduling_block_types_configuration>`). This
    is the time that data products are kept in storage after the pipeline has finished processing.
    The data retention period is the same for all pipelines in the block.