Pipeline configuration file: pipelines.json
This JSON file contains parameters for all the pipelines that are used by different types of scheduling blocks. Each root level key is a different pipeline name and this contains a dictionary of pipeline parameter key-value pairs.
Example pipelines are defined in the example configuration file found in the repository. You can add additional pipelines as needed.
Parameters
description: Description of the pipelinenode_hours_mean: Standardised unit (Nh) of computation representing the compute capability of 1 Intel Ice Lake node for 1 hournode_hours_uncertainty: Uncertainty as a fraction of node hours used for sampling in Monte Carlo simulationspct_parallelism_min: Percentage parallelism of pipeline - represents how scalable the pipeline is. Used for setting lower limits in Monte Carlo simulations.pct_parallelism_max: Maximum percentage parallelism of pipeline used for setting upper limits in Monte Carlo simulationsdata_product_storage_gb: Storage (GB) required for intermediate data productsnum_nodes: Number of compute nodes to allocate for this pipelinepriority: Optional priority of the pipeline - used to determine the order in which pipelines are scheduled. Lower numbers represent higher priority. Higher priority pipelines will be scheduled first. If none are specified, the default is to execute all pipelines in series for a given scheduling block instance. If any are specified these will be executed in order of priority, with the remaining pipelines executed in parallel after the last priority pipeline.
Note
Values for node_hours_mean, pct_parallelism_min and pct_parallelism_max provided in
the pipelines configuration file are rough estimates. Better estimates should be obtained from
benchmarking.
Note
Values for num_nodes will be ignored during the optimisation process as this is one of the
parameters that will be optimised.
Note
When running simulations without Monte Carlo iterations (default), the node_hours_mean and
pct_parallelism_min parameters are used directly as values for node_hours and
pct_parallelism and these will remain fixed throughout the simulation. When Monte Carlo
iterations are specified, these values will be sampled at each iteration – the
pct_parallelism parameter from a uniform distribution between pct_parallelism_min and
pct_parallelism_max and the node_hours parameter from a zero-truncated normal
distribution with mean node_hours_mean and standard deviation of node_hours_mean *
node_hours_uncertainty.
Note
Values for node_hours and pct_parallelism provided in the pipelines configuration file are rough estimates. Better estimates should be obtained from benchmarking.
Note
Values for num_nodes will be ignored during the optimisation process as this is one of the
parameters that will be optimised. If using Monte Carlo simulations (optional when running the
optimisation) the pct_parallelism parameter will be sampled from a uniform distribution
between pct_parallelism and pct_parallelism_max and the node_hours parameter will be
sampled from a zero-truncated normal distribution with mean node_hours and standard deviation
of node_hours * node_hours_uncertainty.
Note
The data retention period for data products is set in the scheduling block types configuration file (see here). This is the time that data products are kept in storage after the pipeline has finished processing. The data retention period is the same for all pipelines in the block.