Pipeline configuration file: ``pipelines.json`` =============================================== This JSON file contains parameters for all the pipelines that are used by different types of scheduling blocks. Each root level key is a different pipeline name and this contains a dictionary of pipeline parameter key-value pairs. Example pipelines are defined in the example configuration file found in the repository. You can add additional pipelines as needed. Parameters ---------- - ``description``: Description of the pipeline - ``node_hours_mean``: Standardised unit (Nh) of computation representing the compute capability of 1 Intel Ice Lake node for 1 hour - ``node_hours_uncertainty``: Uncertainty as a fraction of node hours used for sampling in Monte Carlo simulations - ``pct_parallelism_min``: Percentage parallelism of pipeline - represents how scalable the pipeline is. Used for setting lower limits in Monte Carlo simulations. - ``pct_parallelism_max``: Maximum percentage parallelism of pipeline used for setting upper limits in Monte Carlo simulations - ``data_product_storage_gb``: Storage (GB) required for intermediate data products - ``num_nodes``: Number of compute nodes to allocate for this pipeline - ``priority``: Optional priority of the pipeline - used to determine the order in which pipelines are scheduled. Lower numbers represent higher priority. Higher priority pipelines will be scheduled first. If none are specified, the default is to execute all pipelines in series for a given scheduling block instance. If any are specified these will be executed in order of priority, with the remaining pipelines executed in parallel after the last priority pipeline. .. note:: Values for ``node_hours_mean``, ``pct_parallelism_min`` and ``pct_parallelism_max`` provided in the pipelines configuration file are rough estimates. Better estimates should be obtained from benchmarking. .. note:: Values for ``num_nodes`` will be ignored during the optimisation process as this is one of the parameters that will be optimised. .. note:: When running simulations without Monte Carlo iterations (default), the ``node_hours_mean`` and ``pct_parallelism_min`` parameters are used directly as values for ``node_hours`` and ``pct_parallelism`` and these will remain fixed throughout the simulation. When Monte Carlo iterations are specified, these values will be sampled at each iteration -- the ``pct_parallelism`` parameter from a uniform distribution between ``pct_parallelism_min`` and ``pct_parallelism_max`` and the ``node_hours`` parameter from a zero-truncated normal distribution with mean ``node_hours_mean`` and standard deviation of ``node_hours_mean * node_hours_uncertainty``. .. note:: Values for `node_hours` and `pct_parallelism` provided in the pipelines configuration file are rough estimates. Better estimates should be obtained from benchmarking. .. note:: Values for `num_nodes` will be ignored during the optimisation process as this is one of the parameters that will be optimised. If using Monte Carlo simulations (optional when running the optimisation) the `pct_parallelism` parameter will be sampled from a uniform distribution between `pct_parallelism` and `pct_parallelism_max` and the `node_hours` parameter will be sampled from a zero-truncated normal distribution with mean `node_hours` and standard deviation of ``node_hours * node_hours_uncertainty``. .. note:: The data retention period for data products is set in the scheduling block types configuration file (see :doc:`here `). This is the time that data products are kept in storage after the pipeline has finished processing. The data retention period is the same for all pipelines in the block.