Pipeline configuration file: pipelines.json

This JSON file contains parameters for all the pipelines that are used by different types of scheduling blocks. Each root level key is a different pipeline name and this contains a dictionary of pipeline parameter key-value pairs.

Example pipelines are defined in the example configuration file found in the repository. You can add additional pipelines as needed.

Parameters

  • description: Description of the pipeline

  • node_hours_mean: Standardised unit (Nh) of computation representing the compute capability of 1 Intel Ice Lake node for 1 hour

  • node_hours_uncertainty: Uncertainty as a fraction of node hours used for sampling in Monte Carlo simulations

  • pct_parallelism_min: Percentage parallelism of pipeline - represents how scalable the pipeline is. Used for setting lower limits in Monte Carlo simulations.

  • pct_parallelism_max: Maximum percentage parallelism of pipeline used for setting upper limits in Monte Carlo simulations

  • data_product_storage_gb: Storage (GB) required for intermediate data products

  • num_nodes: Number of compute nodes to allocate for this pipeline

  • priority: Optional priority of the pipeline - used to determine the order in which pipelines are scheduled. Lower numbers represent higher priority. Higher priority pipelines will be scheduled first. If none are specified, the default is to execute all pipelines in series for a given scheduling block instance. If any are specified these will be executed in order of priority, with the remaining pipelines executed in parallel after the last priority pipeline.

Note

Values for node_hours_mean, pct_parallelism_min and pct_parallelism_max provided in the pipelines configuration file are rough estimates. Better estimates should be obtained from benchmarking.

Note

Values for num_nodes will be ignored during the optimisation process as this is one of the parameters that will be optimised.

Note

When running simulations without Monte Carlo iterations (default), the node_hours_mean and pct_parallelism_min parameters are used directly as values for node_hours and pct_parallelism and these will remain fixed throughout the simulation. When Monte Carlo iterations are specified, these values will be sampled at each iteration – the pct_parallelism parameter from a uniform distribution between pct_parallelism_min and pct_parallelism_max and the node_hours parameter from a zero-truncated normal distribution with mean node_hours_mean and standard deviation of node_hours_mean * node_hours_uncertainty.

Note

Values for node_hours and pct_parallelism provided in the pipelines configuration file are rough estimates. Better estimates should be obtained from benchmarking.

Note

Values for num_nodes will be ignored during the optimisation process as this is one of the parameters that will be optimised. If using Monte Carlo simulations (optional when running the optimisation) the pct_parallelism parameter will be sampled from a uniform distribution between pct_parallelism and pct_parallelism_max and the node_hours parameter will be sampled from a zero-truncated normal distribution with mean node_hours and standard deviation of node_hours * node_hours_uncertainty.

Note

The data retention period for data products is set in the scheduling block types configuration file (see here). This is the time that data products are kept in storage after the pipeline has finished processing. The data retention period is the same for all pipelines in the block.