Parameterisation¶
Recombination parameters¶
They are generally given as “image-facet-subgrid” sizes. Number of
facets depends on relation of facet size to image size. Total memory
requirements depend on image size. The option can be used by passing
argument to --rec-set
. Available options and their facet count are as follows:
Image config ( |
Number of facets |
---|---|
256k-256k-256, 192k-192k-256, 160k-160k-256, 128k-128k-256,
96k-96k-256, 80k-80k-256, 64k-64k-256, 48k-48k-256,
40k-40k-256, 32k-32k-256, 24k-24k-256, 20k-20k-256,
16k-16k-256, 12k-12k-256, 8k-8k-256
|
4 |
256k-128k-512, 192k-96k-512, 160k-80k-512, 128k-64k-512,
96k-48k-512, 80k-40k-512, 64k-32k-512, 48k-24k-512,
40k-20k-512, 32k-16k-512, 24k-12k-512, 20k-10k-512,
16k-8k-512, 12k-6k-512, 8k-4k-512, 256k-192k-256,
192k-144k-256, 160k-120k-256, 128k-96k-256, 96k-72k-256,
80k-60k-256, 64k-48k-256, 48k-36k-256, 40k-30k-256,
32k-24k-256, 24k-18k-256, 20k-15k-256, 16k-12k-256,
12k-9k-256, 8k-6k-256, 256k-128k-256, 192k-96k-256,
160k-80k-256, 128k-64k-256, 96k-48k-256, 80k-40k-256,
64k-32k-256, 48k-24k-256, 40k-20k-256, 32k-16k-256,
24k-12k-256, 20k-10k-256, 16k-8k-256, 12k-6k-256,
8k-4k-256
|
9 |
256k-96k-512, 192k-72k-512, 160k-60k-512, 128k-48k-512,
96k-36k-512, 80k-30k-512, 64k-24k-512, 48k-18k-512,
40k-15k-512, 32k-12k-512, 24k-9k-512, 20k-7680-512,
16k-6k-512, 12k-4608-512, 8k-3k-512, 256k-96k-256,
192k-72k-256, 160k-60k-256, 128k-48k-256, 96k-36k-256,
80k-30k-256, 64k-24k-256, 48k-18k-256, 40k-15k-256,
32k-12k-256, 24k-9k-256, 20k-7680-256, 16k-6k-256,
12k-4608-256, 8k-3k-256
|
16 |
192k-64k-768, 96k-32k-768, 48k-16k-768, 24k-8k-768,
12k-4k-768, 256k-64k-1k, 192k-48k-1k, 160k-40k-1k,
128k-32k-1k, 96k-24k-1k, 80k-20k-1k, 64k-16k-1k,
48k-12k-1k, 40k-10k-1k, 32k-8k-1k, 24k-6k-1k,
20k-5k-1k, 16k-4k-1k, 12k-3k-1k, 8k-2k-1k,
192k-48k-768, 96k-24k-768, 48k-12k-768, 24k-6k-768,
12k-3k-768, 256k-64k-256, 192k-48k-256, 160k-40k-256,
128k-32k-256, 96k-24k-256, 80k-20k-256, 64k-16k-256,
48k-12k-256, 40k-10k-256, 32k-8k-256, 24k-6k-256,
20k-5k-256, 16k-4k-256, 12k-3k-256, 8k-2k-256
|
36 |
256k-48k-1k, 192k-36k-1k, 160k-30k-1k, 128k-24k-1k,
96k-18k-1k, 80k-15k-1k, 64k-12k-1k, 48k-9k-1k,
40k-7680-1k, 32k-6k-1k, 24k-4608-1k, 20k-3840-1k,
16k-3k-1k, 12k-2304-1k, 8k-1536-1k, 256k-64k-512,
192k-48k-512, 160k-40k-512, 128k-32k-512, 96k-24k-512,
80k-20k-512, 64k-16k-512, 48k-12k-512, 40k-10k-512,
32k-8k-512, 24k-6k-512, 20k-5k-512, 16k-4k-512,
12k-3k-512, 8k-2k-512, 256k-48k-512, 192k-36k-512,
160k-30k-512, 128k-24k-512, 96k-18k-512, 80k-15k-512,
64k-12k-512, 48k-9k-512, 40k-7680-512, 32k-6k-512,
24k-4608-512, 20k-3840-512, 16k-3k-512, 12k-2304-512,
8k-1536-512
|
64 |
192k-32k-1536, 96k-16k-1536, 48k-8k-1536, 24k-4k-1536,
12k-2k-1536, 192k-32k-768, 96k-16k-768, 48k-8k-768,
24k-4k-768, 12k-2k-768
|
81 |
256k-32k-2k, 192k-24k-2k, 160k-20k-2k, 128k-16k-2k,
96k-12k-2k, 80k-10k-2k, 64k-8k-2k, 48k-6k-2k,
40k-5k-2k, 32k-4k-2k, 24k-3k-2k, 20k-2560-2k,
16k-2k-2k, 12k-1536-2k, 8k-1k-2k, 192k-24k-1536,
96k-12k-1536, 48k-6k-1536, 24k-3k-1536, 12k-1536-1536,
256k-32k-512, 192k-24k-512, 160k-20k-512, 128k-16k-512,
96k-12k-512, 80k-10k-512, 64k-8k-512, 48k-6k-512,
40k-5k-512, 32k-4k-512, 24k-3k-512, 20k-2560-512,
16k-2k-512, 12k-1536-512, 8k-1k-512, 256k-48k-256,
192k-36k-256, 160k-30k-256, 128k-24k-256, 96k-18k-256,
80k-15k-256, 64k-12k-256, 48k-9k-256, 40k-7680-256,
32k-6k-256, 24k-4608-256, 20k-3840-256, 16k-3k-256,
12k-2304-256, 8k-1536-256
|
144 |
256k-24k-2k, 192k-18k-2k, 160k-15k-2k, 128k-12k-2k,
96k-9k-2k, 80k-7680-2k, 64k-6k-2k, 48k-4608-2k,
40k-3840-2k, 32k-3k-2k, 24k-2304-2k, 20k-1920-2k,
16k-1536-2k, 12k-1152-2k, 8k-768-2k, 256k-32k-1k,
192k-24k-1k, 160k-20k-1k, 128k-16k-1k, 96k-12k-1k,
80k-10k-1k, 64k-8k-1k, 48k-6k-1k, 40k-5k-1k,
32k-4k-1k, 24k-3k-1k, 20k-2560-1k, 16k-2k-1k,
12k-1536-1k, 8k-1k-1k, 192k-24k-768, 96k-12k-768,
48k-6k-768, 24k-3k-768, 12k-1536-768, 256k-24k-1k,
192k-18k-1k, 160k-15k-1k, 128k-12k-1k, 96k-9k-1k,
80k-7680-1k, 64k-6k-1k, 48k-4608-1k, 40k-3840-1k,
32k-3k-1k, 24k-2304-1k, 20k-1920-1k, 16k-1536-1k,
12k-1152-1k, 8k-768-1k
|
256 |
192k-16k-768, 96k-8k-768, 48k-4k-768, 24k-2k-768,
12k-1k-768
|
324 |
192k-16k-1536, 96k-8k-1536, 48k-4k-1536, 24k-2k-1536,
12k-1k-1536, 256k-16k-1k, 192k-12k-1k, 160k-10k-1k,
128k-8k-1k, 96k-6k-1k, 80k-5k-1k, 64k-4k-1k,
48k-3k-1k, 40k-2560-1k, 32k-2k-1k, 24k-1536-1k,
20k-1280-1k, 16k-1k-1k, 12k-768-1k, 8k-512-1k,
256k-24k-512, 192k-18k-512, 160k-15k-512, 128k-12k-512,
96k-9k-512, 80k-7680-512, 64k-6k-512, 48k-4608-512,
40k-3840-512, 32k-3k-512, 24k-2304-512, 20k-1920-512,
16k-1536-512, 12k-1152-512, 8k-768-512
|
576 |
256k-16k-2k, 192k-12k-2k, 160k-10k-2k, 128k-8k-2k,
96k-6k-2k, 80k-5k-2k, 64k-4k-2k, 48k-3k-2k,
40k-2560-2k, 32k-2k-2k, 24k-1536-2k, 20k-1280-2k,
16k-1k-2k, 12k-768-2k, 8k-512-2k, 192k-12k-1536,
96k-6k-1536, 48k-3k-1536, 24k-1536-1536, 12k-768-1536,
256k-12k-2k, 192k-9k-2k, 160k-7680-2k, 128k-6k-2k,
96k-4608-2k, 80k-3840-2k, 64k-3k-2k, 48k-2304-2k,
40k-1920-2k, 32k-1536-2k, 24k-1152-2k, 20k-960-2k,
16k-768-2k, 12k-576-2k, 8k-384-2k
|
1024 |
192k-8k-1536, 96k-4k-1536, 48k-2k-1536, 24k-1k-1536,
12k-512-1536, 192k-12k-768, 96k-6k-768, 48k-3k-768,
24k-1536-768, 12k-768-768
|
1296 |
256k-8k-2k, 192k-6k-2k, 160k-5k-2k, 128k-4k-2k,
96k-3k-2k, 80k-2560-2k, 64k-2k-2k, 48k-1536-2k,
40k-1280-2k, 32k-1k-2k, 24k-768-2k, 20k-640-2k,
16k-512-2k, 12k-384-2k, 8k-256-2k, 256k-12k-1k,
192k-9k-1k, 160k-7680-1k, 128k-6k-1k, 96k-4608-1k,
80k-3840-1k, 64k-3k-1k, 48k-2304-1k, 40k-1920-1k,
32k-1536-1k, 24k-1152-1k, 20k-960-1k, 16k-768-1k,
12k-576-1k, 8k-384-1k
|
2304 |
192k-6k-1536, 96k-3k-1536, 48k-1536-1536, 24k-768-1536,
12k-384-1536
|
5184 |
256k-6k-2k, 192k-4608-2k, 160k-3840-2k, 128k-3k-2k,
96k-2304-2k, 80k-1920-2k, 64k-1536-2k, 48k-1152-2k,
40k-960-2k, 32k-768-2k, 24k-576-2k, 20k-480-2k,
16k-384-2k, 12k-288-2k, 8k-192-2k
|
9216 |
As for the size of the image, we can estimate it using image-size * image-size * 16 Bytes. For instance, for 8k image, the total image size would be 8192x8192x16 = 1073741824 B = 1024 MiB.
Certain image configurations have alias and they are listed as below
Image config ( |
alias |
512-216-256 |
T05_ |
8k-2k-512 |
tiny |
16k-8k-512 |
small |
32k-8k-1k |
smallish |
64k-16k-1k |
medium |
96k-12k-1k |
large |
128k-32k-2k |
tremendous |
256k-32k-2k |
huge |
--rec-set
option can be passed using either sizes like
--rec-set=8k-2k-512
or alias name like --rec-set=small
. The
size of the image gives the memory requirements for each
recombination parameter set. For instance, running the case with
--rec-set=256k-32k-2k
will require at least 1 TiB of cumulative
memory on all the reserved nodes. However, the real memory requirement
would be much more than 1 TiB. You can find the approximate memory
required to run each of these cases in Memory requirements.
Note that the --rec-set=256k-32k-2k
is not suited for
--vis-set=lowbd2
. As SKA1 LOW works with higher wavelengths,
this recombination set will give an enormous Field of View (fov),
which means we are approximating more curved sky into 2D image.
The fov can be computed as
fov = fov_frac * image_size * c / (2 * max_bl * max_freq)
For SKA1 LOW, maximum baseline would be 60 km and maximum
frequency would be 300 MHz and so using an image size of 131072
and 0.75 of field of view of image, we obatain fov = 0.81, which
is 40% of sky sphere (which goes from -1 to +1, and therefore as size 2).
This can be very inefficient and this recombination should be
only used with --vis-set=midr5
. In the case of SKA1 Mid, this
value comes to 0.2.
By default number of MPI processes are divided equally among producer (facet worker) and subgrid workers. The recommendation is to have one facet worker per facet, plus one. Background is that the prototype can’t properly parallelise work on different facets if a worker has to hold multiple, which means that work on the different facets will be partly sequentialised, which can lead to deadlock if the queues fill up. On the other hand, there is little reason in not having one worker per facet even if it means additional ranks: They are heavily I/O bound, and therefore rarely compete with each other or other processing on the node. But they create many unnecessary threads, which can become a problem eventually. This will be addressed in the future work of the prototype.
Furthermore, one additional worker is a good idea because this means
we have a dedicated worker for work assignment. For basically any
non-trivial distributed run, assigning work quickly is essential, and
MPI seems to introduce big delays into such request-reply interactions
while there’s heavy data movement to the same rank. Usually, the last
MPI rank is the assignment worker. Important to note here is there
is a deadlock situation when not using facet workers, ie,
--facet-workers=0
. Always use at least one or more facet workers.
Visibility set¶
This parameter specifies the telescope layout and defines the
baselines. This option can be invoked using --vis-set
. Several
configurations are available like “lowbd2”, “lowbd2-core”, “lowr3”,
“midr5”, “lofar”, “vlaa”. “lowbd2” and “midr5” correspond to SKA LOW
and SKA MID layouts. “vlaa” contain very few antennas and thus very
fast to run. It can be used to check the running of the code. Only
“lowbd2”, “midr5” and “vlaa” are extensively tested.
Time and frequency settings¶
These parameters help us to define the time snapshot, dump times,
frequency range and number of frequency channels. These can be used as
--time=<start>:<end>/<steps>[/<chunk>]
and
--freq=<start>:<end>/<steps>[/<chunk>]
. <start>
and <end>
indicate ranges of time and frequency, <step>
indicate time or
frequency step. Finally <chunk>
is both the chunk size used for
writing visibilities out in HDF5, but also decides the granularity at
which we handle visibilities internally: Each chunk will be generated
as a whole. Note that the larger the chunks, the more likely it is
that they involve data from multiple subgrids, which might require
re-writing them. For instance, --time=-460:460/1024/32
--freq=260e6:300e6/8192/32
means 15 minutes snapshot with 0.9 sec
dump rate (hence 1024 steps) and 32 chunks in time. Similarly 40 MHz
is divided into 8192 frequency channels with a chunk size of 32.
Visibilities are nothing but complex doubles (16 bytes each) and with
32 chunks in time and frequency means 16 * 32 * 32/1024 = 16 KiB
chunk.
Declination¶
Declination of the phase centre. Due to the position of the SKA
telescopes on the southern hemisphere, values around -30 result in a
phase centre closest to the zenith, and therefore lowest
non-coplanarity complexity. This can be passed using --dec=-30
.
But at this point the complexity due to non-coplanarity is minimised
via reprojection - this means that up to around 45 degrees, there
should be little difference in terms of performance.
Source count¶
The default is that we generate a random image that is entirely zeroes
with a number of on-grid source pixels of intensity 1 in them. Using
this approach it is trivial to check the accuracy using direct Fourier
transforms. Activating this option means that the benchmark will check
random samples of both subgrids and visibilities and collect RMSE
statistics. Number of source counts can be used by setting
--source-count=10
. This case will set 10 random point sources in
cloud of zeros in the image.
Producer thread count¶
The number of OpenMP threads configured during the runtime is used
for the subgrid workers as they do most of computational intensive tasks.
On the other hand facet workers only hold the facet data in the memory
while sending subgrids to subgrid workers. Hence, they do not need as
many OpenMP threads as subgrid workers. The number of threads for facet
workers can be configured using --producer-threads
option.
Queue parameters¶
Multiple limited sized queues are used in the prototype in order to exert the back-pressure and prevent running out of memory.
Send queue (
--send-queue
): Number of subgrids facet workers transfer out in parallel. This is per OpenMP thread!Subgrid queue(
--subgrid-queue
): Number of receive slots total per subgrid worker. Also decides number of subgrids that are kept.
These are roughly opposite sides of the same coin. Note that the length of the subgrid queue strongly interacts with dynamic load balancing: If the subgrid queue is long, we assign work relatively early, and might be stuck with a bad work assignment. If the subgrid queue is too short, we might lose parallelism and struggle to keep workers busy.
Task queue (--bls-per-task
) for degridding visibilities from
completed subgrids. One task (= one thread) always handles one
baseline at a time (looping through all chunks overlapping the
subgrid). By assigning multiple baselines per task we can make things
more coarse-granular, which makes sense to keep task switching costs
low and prevent generating too many tasks.
Finally, the task queue (--task-queue
) limit is there to prevent
us running into OpenMP limits around the number of tasks (something
like that seems to happen around ~256). Beyond that, a shorter task
queue means we proceed through subgrids in a more sequential fashion,
freeing up subgrid slots more quickly. However as usual we need a
queue long enough to provide enough parallelism for all threads.
Gridding options¶
Shouldn’t really have to override these very often except if gunning
for higher accuracy. This option can be used by passing arguments to
--grid
and --grid-x0
. The x0 value determines how much of
image space we actually use (and therefore have to cover with facets).
Reducing it will lead to better precision and fewer facets, but also
less usable field-of-view.
Non-coplanarity¶
We generally have two ways for dealing with this -
w-stacking: Basically like an outer loop on the recombination. Precise no matter the distance in w, but all facets need to be re-generated and image data needs to be re-distributed. Still not as inefficient as it sounds, as we skip all unneeded grid space. Completely okay to have hundreds of w-stacking planes.
w-towers: Subgrid workers use subgrid data to infer a stack of nearby subgrids, without needing to go back to image data. Will cause errors to creep in from the sides of the subgrid, therefore must have more subgrid margins (= place them closer together).
The w-step (--wstep
) is (roughly) how many w-tower planes we
have before we rather use w-stacking to generate the subgrid. More
w-towers means that we move more FFT work from facet workers to
subgrid workers. The margin (--margin
) has (unfortunately) to be
chosen by trial-and-error. Look out for messages such as
Subgrid x/y/z (A baselines, rmse B, @wmax: C)
in the log: Comparing B and C gives an indication how badly the accuracy at the top of the “w-towers” has detoriated. Where C is larger than the required visibility accuracy, either increase the margin or reduce the wstep. Note that the margin will get rounded (up), as not all subgrid spacings are permissible.
The target error (--target-err
) specifies maximum admissible
error for determining w-tower height. This value can be provided
in place of --wstep
, where the number of w-tower planes are
estimated based on the target error value. If this is not provided,
an order of magnitude off the best error that we would get from
recombination parameters is used to estimate w-tower planes.
Writer settings¶
If we want to write the output visibilities to the disk or file
system, we need to pass the file path as a positional argument. It is
important to use %d
placeholder, like /tmp/out%d.h5
when
passing this argument. The placeholder will be replaced by a number
determined by number of MPI ranks and number of writer threads. By
default, the code always use 2 writer threads per streamer process to
output the visibility data. This can be changed by using
--writer-count
option.
Another important detail here is the Imaging IO test use by
default pthreads
from streamer process as writer threads. Each
writer thread writes to its own file. We are using HDF5 for writing
the visibility data and there is a known issue with the HDF5 when
trying to use multiple threads to write data concurrently
(https://www.hdfgroup.org/2020/11/webinar-enabling-multithreading-concurrency-in-hdf5-community-discussion/).
HDF5 uses some sort of global lock when using multiple threads writing
concurrently even to different files. So, using more than 1 writer
thread will not increase the throughput. Moreover, we noticed from few
runs that using more than 1 writer thread is counter productive.
Fortunately, the solution is to use --fork-writer
option, which
basically forks the process to create a new writer process. The
difference is that now we use a complete process to create the writer
whereas without this option only a thread is created for writer. The
only way to achieve concurrent writes without any global lock using
HDF5 libraries is use full process rather than threads. On the MacOS
--fork-writer
will not work as semaphores locks used for the
Darwin kernel are based on threads and not processes.
Often, OpenMPI with openib support for Infiniband will complain about forking the process. This is due to inconsistencies between OpenMPI and Openib libraries. From OpenMPI 4.0.0, ucx is the preferred support for Infiniband and forking is not an issue using these libraries.