Compute Resource Requirements

Memory requirements

In the section Recombination parameters, the table contains the image sizes of various possible inputs. We should have a cumulative memory on all compute nodes of at least the size of the image. We are also using limited sized Queue parameters and buffers in the benchmark. These queue sizes are configurable and therefore, we should pay attention to the memory available before altering these queue sizes.

The following table gives the average and maximum memory used for different image sizes.

Antenna config

Image

Avg. cumulative memory used (GB)

Max. cumulative memory used (GB)

lowdb2

16k-8k-512

82

90

32k-8k-1k

120

160

64k-16k-1k

193

276

96k-12k-1k

380

462

128k-32k-2k

707

913

256k-32k-2k

2626

2925

midr5

16k-8k-512

87

90

32k-8k-1k

115

156

64k-16k-1k

176

266

96k-12k-1k

360

440

128k-32k-2k

560

914

256k-32k-2k

2437

2612

These tests are made using 30 nodes with the following hardware on each compute node:

  • 2 Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 16 cores/CPU with hyperthreading enabled

  • 192 GB RAM

  • 1 x 10 Gb Ethernet, 1 x 100 Gb Omni-Path

Each run follows the configuration given in SKA1 LOW and MID settings for lowbd2 and midr5 settings. Number of facet workers and eventually, number of MPI processes are chosen according to the number of facets for each image that can be found in Recombination parameters.

These numbers are only approximate as they include the memory usage by the system resources as well. But this gives an idea of the memory requirements for different image sizes. We only used dry runs, i.e., not writing visibility data to the disk, to obtain these numbers. In the non-dry runs we should take into account the visibility queues that will require additional memory. Note that as stated in Recombination parameters, 256k-32k-2k is not suited to run for lowbd2 configuration. The memory requirements are provided here only for the reference purposes.

Run times

The runtimes obtained from both dry and non-dry runs from the benchmarking tests on JUWELS are shown below.

JUWELS runs

This gives a reference run times of the benchmark code using 384 to 6144 cores. The actual run time is in the order of the mean stream time plus MPI start up and pre configuration overheads. No sever load balancing issues were observed for the runs. The above runtime values for the dry runs can be considered as a good reference for running the prototype.

Notice that the case of runtimes for non-dry runs heavily depend on the I/O bandwidth offered by the underlying parallel file system. Care should be taken when launching such runs as they can overload the file system cluster. In the case shown, we obtained an I/O bandwidth around 100 GB/s, where the prototype generated more than 32 TB of data. When running on the clusters that offers inferior throughputs, reservation time should be estimated accordingly based on the amount of data the prototype will generate and available I/O bandwidth. The approximate amount of data that would be generated for different configurations are presented in SKA1 LOW and MID settings. It is also worth noting that the bigger chunk sizes result in more data. For instance, for the configuration used for JUWELS runs, we should expect a visibility data around 17 TB. But we ended up writing more than 32 TB of data because we used relatively bigger chunks of 1 MiB for these runs. The exact amount of data produced cannot be estimated a priori, but for the chunk size of 1 MiB, a factor of 2 seems to be a good estimation.