How to optimise a hardware configuration

This tutorial takes you through the steps required to optimise a hardware configuration using the SDP resource model. An overview of this functionality is provided in the flowchart below:

Optimise hardware configuration flowchart

The optimisation process will determine the best hardware configuration for a specific hardware budget given an observing schedule and its component scheduling block types and pipelines. This includes:

  • Total number of compute nodes

  • Total capacity storage in PB

  • Total performance storage in TB

  • Number of compute nodes allocated to each pipeline “group”

The optimisation step will split the provided hardware budget between compute nodes, capacity storage and performance storage to find the optimum hardware configuration. The cost of each of these components is currently hardcoded in the simulation.

Pipelines are grouped automatically by the simulation (using K-means clustering) based on their node_hours_mean, pct_parallelism_min and pct_parallelism_max parameters. The number of groups will be set to the number of pipelines or num_pipeline_groups (default 6), whichever is smaller. This allows a better search of the parameter space.

The optuna library is used to perform the optimisation with the default (TPESampler) optimisation algorithm. See the optuna documentation for more information.

Read the following sections to ensure you are familiar with the simulation’s inputs:

Note

A hardware configuration file is not required for the optimisation.

Ensure you have followed the installation instructions and successfully run the simulation before continuing.

Requirements for the optimisation

To run the hardware optimisation, you will need the following inputs:

  1. An observing schedule you want to simulate (JSON).

    Its contents should look something like this (read more about this in the Defining an observing schedule file section):

    {
        "scheduling_blocks": [
            "Calibrated Visibilities",
            "Continuum Shallow",
            "Beam Monitoring",
            "Continuum Deep",
            "Continuum Shallow",
            "Continuum Deep",
            "Maintenance",
            "Spectral Deep"
        ]
    }
    

    You will need the path to this file to run the optimisation.

  2. Configuration of scheduling block types that make up your observing schedule (JSON).

    This should look something like this (read more about this in the Scheduling block types configuration file section):

    ...
    {
        "my_scheduling_block_type": {
            "description": "My scheduling block type",
            "short_name": "MST",
            "scheduling_block_instance_time_hrs": 1.0,
            "integration_time_hrs": 10.0,
            "pipeline_steps": ["pipeline_A", "pipeline_B"],
            "raw_vis_gb": 10.0,
            "processed_vis_gb": 20.0,
            "data_retention_hrs": 24.0,
        }
    }
    ...
    

    Any scheduling block type that is in the observing schedule file must be in this JSON file with all key value pairs.

    You will need the path to this file to run the optimisation.

  3. Configuration for each pipeline (JSON).

    This should look something like this for the scheduling block example above (read more about this in the Pipeline configuration file section):

    ...
    {
        "name": "pipeline_A",
        "steps": [
            {
                "description": "A data processing pipeline",
                "node_hours": 1000.0,
                "pct_parallelism": 80.0,
                "data_product_storage_gb": 1000.0,
                "num_nodes": 25,
            }
        ],
    },
    {
        "name": "pipeline_B",
        "steps": [
            {
                "description": "Another data processing pipeline",
                "node_hours": 500.0,
                "pct_parallelism": 50.0,
                "data_product_storage_gb": 3000.0,
                "num_nodes": 50,
            }
        ],
    }
    ...
    

    Any pipeline that is in the scheduling block types configuration file must be in this JSON file with all key value pairs.

    You will need the path to this file to run the optimisation.

    Note

    The num_nodes parameter for each pipeline will be ignored when running the hardware optimisation since these are some of the parameters that will be optimised.

    Note

    If running Monte Carlo simulations, the pct_parallelism parameter will be sampled from a uniform distribution between pct_parallelism_min and pct_parallelism_max and the node_hours parameter will be sampled from a normal distribution with a mean of node_hours_mean and a standard deviation of node_hours_mean * node_hours_uncertainty.

    Note

    To avoid sampling parameters that are not needed, the pipeline configuration file should only contain the pipelines that are used in the scheduling block types configuration file.

  4. Your hardware budget (euros).

    The optimisation process will optimise the split of your budget between compute nodes, capacity storage and performance storage. If you don’t supply your own budget, a default value of 8M Euros will be used.

    Note

    The cost per compute node, capacity storage and performance storage are currently hardcoded.

  5. The number of groups you want to split the pipelines into.

    The number of groups will be the smallest of the number of pipelines in the pipeline configuration file and the num_pipeline_groups parameter. The default value is 6.

  6. The number of trials you want to run.

    This is a key parameter in the optimisation process that determines how many times parameters are sampled and a simulation is run.

    Default is 10.

  7. The number of Monte Carlo iterations you want to run.

    This is an optional part of the optimisation process. If you set this to a value greater than 0 then the percentage parallelism parameter will be sampled from a uniform distribution between pct_parallelism_min and pct_parallelism_max for each pipeline and the node_hours parameter will be sampled from a normal distribution with a mean of node_hours_mean and a standard deviation of node_hours_mean * node_hours_uncertainty. The mean simulation time is then used for each optimisation trial.

    Default is 0.

  8. Whether or not to shuffle the order of observations in the observing schedule.

    This is an optional part of the Monte Carlo simulation.

    Default is False.

  9. The storage name for the database where the results will be stored

    Default is “sqlite:///hardware-optimisation.db”.

    If this database already exists then the results will be appended to it. If it does not exist, a new database will be created.

  10. A study name for the experiment you are running

    This is optional and will default to budget-{budget} where budget is the value you provided.

    Note

    If you do not provide paths to the configuration files, the default paths will be used.

Running the optimisation

Once you have the required inputs above, you can run the hardware optimisation.

The optimisation can be run programmatically as part of a Python script or Jupyter notebook or on the CLI.

Using the CLI

To use the CLI, run the optimisation locally using Poetry:

poetry run optimise_hardware \
--budget 8000000 \
--n_trials 100 \
--observing_schedule_path data/schedules/observation-schedule.json \
--scheduling_block_types_config_path data/config/scheduling_block_types.json \
--pipelines_config_path data/config/pipelines.json \
--storage sqlite:///hardware-optimisation.db \
--num_monte_carlo_iterations 10 \
--num_pipeline_groups 6 \
--shuffle

This will generate logs to the console and store results in a database. Each time a new optimisation is run, the results will be appended to the database.

Get more details on each of the CLI arguments by running:

poetry run optimise_hardware --help

This will output the following help message:

usage: __main__.py [-h] [--budget BUDGET] [--n_trials N_TRIALS]
                   [--observing_schedule_path OBSERVING_SCHEDULE_PATH]
                   [--pipelines_config_path PIPELINES_CONFIG_PATH]
                   [--scheduling_block_types_config_path SCHEDULING_BLOCK_TYPES_CONFIG_PATH]
                   [--storage STORAGE] [--study_name STUDY_NAME]
                   [--num_monte_carlo_iterations NUM_MONTE_CARLO_ITERATIONS]
                   [--shuffle_observations] [--debug]
                   [--num_pipeline_groups NUM_PIPELINE_GROUPS]

Optimise the hardware configuration

options:
  -h, --help            show this help message and exit
  --budget BUDGET       Hardware budget in EUR.
  --n_trials N_TRIALS   Number of trials to run for the optimisation.
  --observing_schedule_path OBSERVING_SCHEDULE_PATH
                        Path to the observing schedule file.
  --pipelines_config_path PIPELINES_CONFIG_PATH
                        Path to the pipelines configuration file.
  --scheduling_block_types_config_path SCHEDULING_BLOCK_TYPES_CONFIG_PATH
                        Path to the scheduling block types file.
  --storage STORAGE     Name of the database for the optuna study. Results
                        will be appended if this already exists, otherwise a
                        new DB will be created.
  --study_name STUDY_NAME
                        Name of the optuna study. If not provided, the study
                        name will be named 'budget-{budget}' where{budget} is
                        taken from the --budget parameter.
  --num_monte_carlo_iterations NUM_MONTE_CARLO_ITERATIONS
                        Number of Monte Carlo iterations to run. If greater
                        than 0, the simulation will be run multiple times with
                        different values for pct_parallelism and node_hours
                        for each pipeline and the average time taken will be
                        used to optimise hardware.
  --shuffle_observations
                        Shuffle the observations list before running the
                        simulation.
  --debug               Set logging level to DEBUG for detailed logging.
  --num_pipeline_groups NUM_PIPELINE_GROUPS
                        Number of groups for clustering pipelines. Must be
                        greater than 0. Pipelines will be clustered into
                        groups and the number of nodes will be sampled for
                        each group. If num_pipeline_groups is greater than the
                        number of pipelines, the number of groups will be set
                        tothe number of pipelines.

To use the CLI, run the application locally using Poetry. You can find more detailed usage information by running:

poetry run resource_usage --help

For the detailed output from the above command, see Using the CLI.

Using the API

The following code snippet demonstrates how to run the optimisation in a Python script.

from ska_sdp_resource_model.simulate.optimise import optimise_hardware

# Specify optimisation settings
n_trials = 1000  # Number of trials to run
num_monte_carlo_iterations = 10  # Number of Monte Carlo iterations to run
storage = "sqlite:///hardware-optimisation.db"  # Name of the database where the results will be stored

# Specify the optimisation inputs
budget = 8e6  # 8 million Euros budget
path_to_observing_schedule = "data/schedules/observing_schedule.json"
path_to_scheduling_block_types = "data/config/scheduling_block_types.json"
path_to_pipeline_config = "data/config/pipelines.json"

# Run the optimisation
study = optimise_hardware(
    budget=budget,
    n_trials=n_trials,
    observing_schedule_path=path_to_observing_schedule,
    scheduling_block_types_config_path=path_to_scheduling_block_types,
    pipelines_config_path=path_to_pipeline_config,
    storage=storage,
    num_monte_carlo_iterations=num_monte_carlo_iterations,
    shuffle=True,
)

# Print the best hardware configuration
print(
    f"Best hardware config:\n {json.dumps(study.best_trial.user_attrs['hardware_config'], indent=4)}"
)
print("Best compute node allocation:")
for pipeline, config in study.best_trial.user_attrs["pipelines_config"].items():
    print(f"   {pipeline}: {config['num_nodes']}")

# Get the results as a pandas DataFrame for further analysis
results = study.trials_dataframe()

Using a Jupyter notebook

We have included example Jupyter notebooks in the notebooks directory of the repository.

  • 02-optimise-hardware.ipynb demonstrates how to run the optimisation and visualise the results.

Visualising the results

To visualise the results of each optimisation trial, after or during a run, go to the CLI run the following (replace the database url as appropriate) and click on the URL to launch the optuna dashboard where you can interact with a variety of plots and compare different studies:

poetry run optuna-dashboard sqlite:///hardware-optimisation.db