Parameterisation

Recombination parameters

They are generally given as “image-facet-subgrid” sizes. Number of facets depends on relation of facet size to image size. Total memory requirements depend on image size. The option can be used by passing argument to --rec-set. Available options and their facet count are as follows:

Image config (--rec-set)

Number of facets

256k-256k-256, 192k-192k-256, 160k-160k-256, 128k-128k-256,
96k-96k-256, 80k-80k-256, 64k-64k-256, 48k-48k-256,
40k-40k-256, 32k-32k-256, 24k-24k-256, 20k-20k-256,
16k-16k-256, 12k-12k-256, 8k-8k-256

4

256k-128k-512, 192k-96k-512, 160k-80k-512, 128k-64k-512,
96k-48k-512, 80k-40k-512, 64k-32k-512, 48k-24k-512,
40k-20k-512, 32k-16k-512, 24k-12k-512, 20k-10k-512,
16k-8k-512, 12k-6k-512, 8k-4k-512, 256k-192k-256,
192k-144k-256, 160k-120k-256, 128k-96k-256, 96k-72k-256,
80k-60k-256, 64k-48k-256, 48k-36k-256, 40k-30k-256,
32k-24k-256, 24k-18k-256, 20k-15k-256, 16k-12k-256,
12k-9k-256, 8k-6k-256, 256k-128k-256, 192k-96k-256,
160k-80k-256, 128k-64k-256, 96k-48k-256, 80k-40k-256,
64k-32k-256, 48k-24k-256, 40k-20k-256, 32k-16k-256,
24k-12k-256, 20k-10k-256, 16k-8k-256, 12k-6k-256,
8k-4k-256

9

256k-96k-512, 192k-72k-512, 160k-60k-512, 128k-48k-512,
96k-36k-512, 80k-30k-512, 64k-24k-512, 48k-18k-512,
40k-15k-512, 32k-12k-512, 24k-9k-512, 20k-7680-512,
16k-6k-512, 12k-4608-512, 8k-3k-512, 256k-96k-256,
192k-72k-256, 160k-60k-256, 128k-48k-256, 96k-36k-256,
80k-30k-256, 64k-24k-256, 48k-18k-256, 40k-15k-256,
32k-12k-256, 24k-9k-256, 20k-7680-256, 16k-6k-256,
12k-4608-256, 8k-3k-256

16

192k-64k-768, 96k-32k-768, 48k-16k-768, 24k-8k-768,
12k-4k-768, 256k-64k-1k, 192k-48k-1k, 160k-40k-1k,
128k-32k-1k, 96k-24k-1k, 80k-20k-1k, 64k-16k-1k,
48k-12k-1k, 40k-10k-1k, 32k-8k-1k, 24k-6k-1k,
20k-5k-1k, 16k-4k-1k, 12k-3k-1k, 8k-2k-1k,
192k-48k-768, 96k-24k-768, 48k-12k-768, 24k-6k-768,
12k-3k-768, 256k-64k-256, 192k-48k-256, 160k-40k-256,
128k-32k-256, 96k-24k-256, 80k-20k-256, 64k-16k-256,
48k-12k-256, 40k-10k-256, 32k-8k-256, 24k-6k-256,
20k-5k-256, 16k-4k-256, 12k-3k-256, 8k-2k-256

36

256k-48k-1k, 192k-36k-1k, 160k-30k-1k, 128k-24k-1k,
96k-18k-1k, 80k-15k-1k, 64k-12k-1k, 48k-9k-1k,
40k-7680-1k, 32k-6k-1k, 24k-4608-1k, 20k-3840-1k,
16k-3k-1k, 12k-2304-1k, 8k-1536-1k, 256k-64k-512,
192k-48k-512, 160k-40k-512, 128k-32k-512, 96k-24k-512,
80k-20k-512, 64k-16k-512, 48k-12k-512, 40k-10k-512,
32k-8k-512, 24k-6k-512, 20k-5k-512, 16k-4k-512,
12k-3k-512, 8k-2k-512, 256k-48k-512, 192k-36k-512,
160k-30k-512, 128k-24k-512, 96k-18k-512, 80k-15k-512,
64k-12k-512, 48k-9k-512, 40k-7680-512, 32k-6k-512,
24k-4608-512, 20k-3840-512, 16k-3k-512, 12k-2304-512,
8k-1536-512

64

192k-32k-1536, 96k-16k-1536, 48k-8k-1536, 24k-4k-1536,
12k-2k-1536, 192k-32k-768, 96k-16k-768, 48k-8k-768,
24k-4k-768, 12k-2k-768

81

256k-32k-2k, 192k-24k-2k, 160k-20k-2k, 128k-16k-2k,
96k-12k-2k, 80k-10k-2k, 64k-8k-2k, 48k-6k-2k,
40k-5k-2k, 32k-4k-2k, 24k-3k-2k, 20k-2560-2k,
16k-2k-2k, 12k-1536-2k, 8k-1k-2k, 192k-24k-1536,
96k-12k-1536, 48k-6k-1536, 24k-3k-1536, 12k-1536-1536,
256k-32k-512, 192k-24k-512, 160k-20k-512, 128k-16k-512,
96k-12k-512, 80k-10k-512, 64k-8k-512, 48k-6k-512,
40k-5k-512, 32k-4k-512, 24k-3k-512, 20k-2560-512,
16k-2k-512, 12k-1536-512, 8k-1k-512, 256k-48k-256,
192k-36k-256, 160k-30k-256, 128k-24k-256, 96k-18k-256,
80k-15k-256, 64k-12k-256, 48k-9k-256, 40k-7680-256,
32k-6k-256, 24k-4608-256, 20k-3840-256, 16k-3k-256,
12k-2304-256, 8k-1536-256

144

256k-24k-2k, 192k-18k-2k, 160k-15k-2k, 128k-12k-2k,
96k-9k-2k, 80k-7680-2k, 64k-6k-2k, 48k-4608-2k,
40k-3840-2k, 32k-3k-2k, 24k-2304-2k, 20k-1920-2k,
16k-1536-2k, 12k-1152-2k, 8k-768-2k, 256k-32k-1k,
192k-24k-1k, 160k-20k-1k, 128k-16k-1k, 96k-12k-1k,
80k-10k-1k, 64k-8k-1k, 48k-6k-1k, 40k-5k-1k,
32k-4k-1k, 24k-3k-1k, 20k-2560-1k, 16k-2k-1k,
12k-1536-1k, 8k-1k-1k, 192k-24k-768, 96k-12k-768,
48k-6k-768, 24k-3k-768, 12k-1536-768, 256k-24k-1k,
192k-18k-1k, 160k-15k-1k, 128k-12k-1k, 96k-9k-1k,
80k-7680-1k, 64k-6k-1k, 48k-4608-1k, 40k-3840-1k,
32k-3k-1k, 24k-2304-1k, 20k-1920-1k, 16k-1536-1k,
12k-1152-1k, 8k-768-1k

256

192k-16k-768, 96k-8k-768, 48k-4k-768, 24k-2k-768,
12k-1k-768

324

192k-16k-1536, 96k-8k-1536, 48k-4k-1536, 24k-2k-1536,
12k-1k-1536, 256k-16k-1k, 192k-12k-1k, 160k-10k-1k,
128k-8k-1k, 96k-6k-1k, 80k-5k-1k, 64k-4k-1k,
48k-3k-1k, 40k-2560-1k, 32k-2k-1k, 24k-1536-1k,
20k-1280-1k, 16k-1k-1k, 12k-768-1k, 8k-512-1k,
256k-24k-512, 192k-18k-512, 160k-15k-512, 128k-12k-512,
96k-9k-512, 80k-7680-512, 64k-6k-512, 48k-4608-512,
40k-3840-512, 32k-3k-512, 24k-2304-512, 20k-1920-512,
16k-1536-512, 12k-1152-512, 8k-768-512

576

256k-16k-2k, 192k-12k-2k, 160k-10k-2k, 128k-8k-2k,
96k-6k-2k, 80k-5k-2k, 64k-4k-2k, 48k-3k-2k,
40k-2560-2k, 32k-2k-2k, 24k-1536-2k, 20k-1280-2k,
16k-1k-2k, 12k-768-2k, 8k-512-2k, 192k-12k-1536,
96k-6k-1536, 48k-3k-1536, 24k-1536-1536, 12k-768-1536,
256k-12k-2k, 192k-9k-2k, 160k-7680-2k, 128k-6k-2k,
96k-4608-2k, 80k-3840-2k, 64k-3k-2k, 48k-2304-2k,
40k-1920-2k, 32k-1536-2k, 24k-1152-2k, 20k-960-2k,
16k-768-2k, 12k-576-2k, 8k-384-2k

1024

192k-8k-1536, 96k-4k-1536, 48k-2k-1536, 24k-1k-1536,
12k-512-1536, 192k-12k-768, 96k-6k-768, 48k-3k-768,
24k-1536-768, 12k-768-768

1296

256k-8k-2k, 192k-6k-2k, 160k-5k-2k, 128k-4k-2k,
96k-3k-2k, 80k-2560-2k, 64k-2k-2k, 48k-1536-2k,
40k-1280-2k, 32k-1k-2k, 24k-768-2k, 20k-640-2k,
16k-512-2k, 12k-384-2k, 8k-256-2k, 256k-12k-1k,
192k-9k-1k, 160k-7680-1k, 128k-6k-1k, 96k-4608-1k,
80k-3840-1k, 64k-3k-1k, 48k-2304-1k, 40k-1920-1k,
32k-1536-1k, 24k-1152-1k, 20k-960-1k, 16k-768-1k,
12k-576-1k, 8k-384-1k

2304

192k-6k-1536, 96k-3k-1536, 48k-1536-1536, 24k-768-1536,
12k-384-1536

5184

256k-6k-2k, 192k-4608-2k, 160k-3840-2k, 128k-3k-2k,
96k-2304-2k, 80k-1920-2k, 64k-1536-2k, 48k-1152-2k,
40k-960-2k, 32k-768-2k, 24k-576-2k, 20k-480-2k,
16k-384-2k, 12k-288-2k, 8k-192-2k

9216

As for the size of the image, we can estimate it using image-size * image-size * 16 Bytes. For instance, for 8k image, the total image size would be 8192x8192x16 = 1073741824 B = 1024 MiB.

Certain image configurations have alias and they are listed as below

Image config (--rec-set)

alias

512-216-256

T05_

8k-2k-512

tiny

16k-8k-512

small

32k-8k-1k

smallish

64k-16k-1k

medium

96k-12k-1k

large

128k-32k-2k

tremendous

256k-32k-2k

huge

--rec-set option can be passed using either sizes like --rec-set=8k-2k-512 or alias name like --rec-set=small. The size of the image gives the memory requirements for each recombination parameter set. For instance, running the case with --rec-set=256k-32k-2k will require at least 1 TiB of cumulative memory on all the reserved nodes. However, the real memory requirement would be much more than 1 TiB. You can find the approximate memory required to run each of these cases in Memory requirements.

Note that the --rec-set=256k-32k-2k is not suited for --vis-set=lowbd2. As SKA1 LOW works with higher wavelengths, this recombination set will give an enormous Field of View (fov), which means we are approximating more curved sky into 2D image. The fov can be computed as

fov = fov_frac * image_size * c / (2 * max_bl * max_freq)

For SKA1 LOW, maximum baseline would be 60 km and maximum frequency would be 300 MHz and so using an image size of 131072 and 0.75 of field of view of image, we obatain fov = 0.81, which is 40% of sky sphere (which goes from -1 to +1, and therefore as size 2). This can be very inefficient and this recombination should be only used with --vis-set=midr5. In the case of SKA1 Mid, this value comes to 0.2.

By default number of MPI processes are divided equally among producer (facet worker) and subgrid workers. The recommendation is to have one facet worker per facet, plus one. Background is that the prototype can’t properly parallelise work on different facets if a worker has to hold multiple, which means that work on the different facets will be partly sequentialised, which can lead to deadlock if the queues fill up. On the other hand, there is little reason in not having one worker per facet even if it means additional ranks: They are heavily I/O bound, and therefore rarely compete with each other or other processing on the node. But they create many unnecessary threads, which can become a problem eventually. This will be addressed in the future work of the prototype.

Furthermore, one additional worker is a good idea because this means we have a dedicated worker for work assignment. For basically any non-trivial distributed run, assigning work quickly is essential, and MPI seems to introduce big delays into such request-reply interactions while there’s heavy data movement to the same rank. Usually, the last MPI rank is the assignment worker. Important to note here is there is a deadlock situation when not using facet workers, ie, --facet-workers=0. Always use at least one or more facet workers.

Visibility set

This parameter specifies the telescope layout and defines the baselines. This option can be invoked using --vis-set. Several configurations are available like “lowbd2”, “lowbd2-core”, “lowr3”, “midr5”, “lofar”, “vlaa”. “lowbd2” and “midr5” correspond to SKA LOW and SKA MID layouts. “vlaa” contain very few antennas and thus very fast to run. It can be used to check the running of the code. Only “lowbd2”, “midr5” and “vlaa” are extensively tested.

Time and frequency settings

These parameters help us to define the time snapshot, dump times, frequency range and number of frequency channels. These can be used as --time=<start>:<end>/<steps>[/<chunk>] and --freq=<start>:<end>/<steps>[/<chunk>]. <start> and <end> indicate ranges of time and frequency, <step> indicate time or frequency step. Finally <chunk> is both the chunk size used for writing visibilities out in HDF5, but also decides the granularity at which we handle visibilities internally: Each chunk will be generated as a whole. Note that the larger the chunks, the more likely it is that they involve data from multiple subgrids, which might require re-writing them. For instance, --time=-460:460/1024/32 --freq=260e6:300e6/8192/32 means 15 minutes snapshot with 0.9 sec dump rate (hence 1024 steps) and 32 chunks in time. Similarly 40 MHz is divided into 8192 frequency channels with a chunk size of 32. Visibilities are nothing but complex doubles (16 bytes each) and with 32 chunks in time and frequency means 16 * 32 * 32/1024 = 16 KiB chunk.

Declination

Declination of the phase centre. Due to the position of the SKA telescopes on the southern hemisphere, values around -30 result in a phase centre closest to the zenith, and therefore lowest non-coplanarity complexity. This can be passed using --dec=-30. But at this point the complexity due to non-coplanarity is minimised via reprojection - this means that up to around 45 degrees, there should be little difference in terms of performance.

Source count

The default is that we generate a random image that is entirely zeroes with a number of on-grid source pixels of intensity 1 in them. Using this approach it is trivial to check the accuracy using direct Fourier transforms. Activating this option means that the benchmark will check random samples of both subgrids and visibilities and collect RMSE statistics. Number of source counts can be used by setting --source-count=10. This case will set 10 random point sources in cloud of zeros in the image.

Producer thread count

The number of OpenMP threads configured during the runtime is used for the subgrid workers as they do most of computational intensive tasks. On the other hand facet workers only hold the facet data in the memory while sending subgrids to subgrid workers. Hence, they do not need as many OpenMP threads as subgrid workers. The number of threads for facet workers can be configured using --producer-threads option.

Queue parameters

Multiple limited sized queues are used in the prototype in order to exert the back-pressure and prevent running out of memory.

  • Send queue (--send-queue): Number of subgrids facet workers transfer out in parallel. This is per OpenMP thread!

  • Subgrid queue(--subgrid-queue): Number of receive slots total per subgrid worker. Also decides number of subgrids that are kept.

These are roughly opposite sides of the same coin. Note that the length of the subgrid queue strongly interacts with dynamic load balancing: If the subgrid queue is long, we assign work relatively early, and might be stuck with a bad work assignment. If the subgrid queue is too short, we might lose parallelism and struggle to keep workers busy.

Task queue (--bls-per-task) for degridding visibilities from completed subgrids. One task (= one thread) always handles one baseline at a time (looping through all chunks overlapping the subgrid). By assigning multiple baselines per task we can make things more coarse-granular, which makes sense to keep task switching costs low and prevent generating too many tasks.

Finally, the task queue (--task-queue) limit is there to prevent us running into OpenMP limits around the number of tasks (something like that seems to happen around ~256). Beyond that, a shorter task queue means we proceed through subgrids in a more sequential fashion, freeing up subgrid slots more quickly. However as usual we need a queue long enough to provide enough parallelism for all threads.

Gridding options

Shouldn’t really have to override these very often except if gunning for higher accuracy. This option can be used by passing arguments to --grid and --grid-x0. The x0 value determines how much of image space we actually use (and therefore have to cover with facets). Reducing it will lead to better precision and fewer facets, but also less usable field-of-view.

Non-coplanarity

We generally have two ways for dealing with this -

  • w-stacking: Basically like an outer loop on the recombination. Precise no matter the distance in w, but all facets need to be re-generated and image data needs to be re-distributed. Still not as inefficient as it sounds, as we skip all unneeded grid space. Completely okay to have hundreds of w-stacking planes.

  • w-towers: Subgrid workers use subgrid data to infer a stack of nearby subgrids, without needing to go back to image data. Will cause errors to creep in from the sides of the subgrid, therefore must have more subgrid margins (= place them closer together).

The w-step (--wstep) is (roughly) how many w-tower planes we have before we rather use w-stacking to generate the subgrid. More w-towers means that we move more FFT work from facet workers to subgrid workers. The margin (--margin) has (unfortunately) to be chosen by trial-and-error. Look out for messages such as

Subgrid x/y/z (A baselines, rmse B, @wmax: C)

in the log: Comparing B and C gives an indication how badly the accuracy at the top of the “w-towers” has detoriated. Where C is larger than the required visibility accuracy, either increase the margin or reduce the wstep. Note that the margin will get rounded (up), as not all subgrid spacings are permissible.

The target error (--target-err) specifies maximum admissible error for determining w-tower height. This value can be provided in place of --wstep, where the number of w-tower planes are estimated based on the target error value. If this is not provided, an order of magnitude off the best error that we would get from recombination parameters is used to estimate w-tower planes.

Writer settings

If we want to write the output visibilities to the disk or file system, we need to pass the file path as a positional argument. It is important to use %d placeholder, like /tmp/out%d.h5 when passing this argument. The placeholder will be replaced by a number determined by number of MPI ranks and number of writer threads. By default, the code always use 2 writer threads per streamer process to output the visibility data. This can be changed by using --writer-count option.

Another important detail here is the Imaging IO test use by default pthreads from streamer process as writer threads. Each writer thread writes to its own file. We are using HDF5 for writing the visibility data and there is a known issue with the HDF5 when trying to use multiple threads to write data concurrently (https://www.hdfgroup.org/2020/11/webinar-enabling-multithreading-concurrency-in-hdf5-community-discussion/). HDF5 uses some sort of global lock when using multiple threads writing concurrently even to different files. So, using more than 1 writer thread will not increase the throughput. Moreover, we noticed from few runs that using more than 1 writer thread is counter productive.

Fortunately, the solution is to use --fork-writer option, which basically forks the process to create a new writer process. The difference is that now we use a complete process to create the writer whereas without this option only a thread is created for writer. The only way to achieve concurrent writes without any global lock using HDF5 libraries is use full process rather than threads. On the MacOS --fork-writer will not work as semaphores locks used for the Darwin kernel are based on threads and not processes.

Often, OpenMPI with openib support for Infiniband will complain about forking the process. This is due to inconsistencies between OpenMPI and Openib libraries. From OpenMPI 4.0.0, ucx is the preferred support for Infiniband and forking is not an issue using these libraries.