Configuration Settings

Number of MPI processes should be chosen based on the recombination parameters (--rec-set) and number of sockets (NUMA node) available on each node. Essentially number of MPI processes are sum of number of facet workers, number of subgrid workers and number of dynamic assignment worker. Currently, we use one worker for dynamic work assignment and this might change in the future. Number of facet workers should be equal to the number of facets in the recombination parameters, which is presented in Recombination parameters. Number of subgrid workers should be chosen based on the NUMA nodes on each compute node. If a compute node has two sockets, which means two NUMA nodes, we should ideally have two subgrid workers. If there are 16 physical cores on each socket with hyperthreading enabled, we can use all the available cores, both physical and logical, for the subgrid workers. Hence, we can use for this example, number of OpenMP threads would be 32. It is noticed that using all available cores gives us a marginal performance benefits for the dry runs.

num_mpi_process = num_facets + num_nodes * num_numa_nodes + num_dynamic_work_assignement

where num_nodes is number of nodes in the reservation, num_numa_nodes is number of NUMA node in each compute node, num_facets is number of facets and finally, num_dynamic_work_assignement is number of dynamic work assignment workers which is 1.

Some of the configurations for single and multiple nodes are:

Single Node

The program can be configured using the command line. Useful configurations:

$ mpirun -n 2 ./iotest --rec-set=T05

Runs a small test-case with two local processes to ensure that everything works correctly. It should show RMSE values of about 1e-8 for all recombined subgrids and 1e-7 for visibilities.

$ ./iotest --rec-set=small
$ ./iotest --rec-set=large

Does a dry run of the “producer” end in stand-alone mode. Primarily useful to check whether enough memory is available. The former will use about 10 GB memory, the latter about 350 GB.

$ ./iotest --rec-set=small --vis-set=vlaa --facet-workers=0

Does a dry run of the “degrid” part in stand-alone mode. Good to check stability and ensure that we can degrid fast enough.

$ ./iotest --rec-set=small --vis-set=vlaa --facet-workers=0 /tmp/out%d.h5

Same as above, but actually writes data to the out the given file. By default, each subgrid worker creates two writer threads and hence %d placeholder is used. Data will be all zeroes, but this runs through the entire back-end without involving actual distribution. Typically quite a bit slower, as writing out data is generally the bottleneck.

$ mpirun -n 2 ./iotest --rec-set=small --vis-set=vlaa /tmp/out%d.h5

Runs the entire thing with one facet and one subgrid worker, this time producing actual-ish visibility data (for a random facet without grid correction).

$ mpirun -n 2 ./iotest --rec-set=small --vis-set=vlaa --time=-230:230/512/128 --freq=225e6:300e6/8192/128 /tmp/out.h5

The “–vis-set” and “–rec-set” parameters are just default parameter sets that can be overridden. The command line above increases time and frequency sampling to the point where it would roughly correspond to an SKA Low snapshot (7 minutes, 25% frequency range). The time and frequency specification is <start>/<end>/<steps>/<chunk>, so in this case 512 time steps with chunk size 128 and 8192 frequency channels with chunks size 128. This will write roughly 9 TB of data with a chunk granularity of 256 KB.

Distributed

As explained the benchmark can also be run across a number of nodes. This will distribute both the facet working set as well as the visibility write rate pretty evenly across nodes. As noted you might want at minimum a producer and a streamer process per node, and configure OpenMP such that its threads take full advantage of the machine’s available cores at least for the subgrid workers. Something that would be worth testing systematically is whether facet workers might not actually be faster with fewer threads. They are likely waiting most of the time.

To distribute facet workers among all the nodes --map-by node argument should be used for OpenMPI. By default OpenMPI assigns the processes in blocks and without --map-by node argument, one or more nodes might get many facet workers. This is not what we want as facet workers are memory bound. With OpenMPI default mapping, we would end up with subgrid workers on all low ranks and facet workers on high ranks. As facet workers wait most of the time (so use little CPU), yet use the a lot of memory, that would cause the entire thing to become very unbalanced.

For example, if we use --rec-set=small across 8 nodes (2 NUMA nodes each and 16 cores on each socket) we want to run 10 producer processes (facet workers) and 16 streamer processes (subgrid workers), using 16 threads each:

export OMP_NUM_THREADS=16
mpirun --map-by node -np 26 ./iotest --facet-workers=10 --rec-set=small $options

This would allocate 2 streamer processes per node with 16 threads each, appropriate for a node with 32 (physical) cores available. Facet workers are typically heavily memory bound and do not interfere too much with co-existing processes outside of reserving large amounts of memory.

This configuration (mpirun --map-by node -np 26 ./iotest --facet-workers=10 --rec-set=small) will just do a full re-distribution of facet/subgrid data between all nodes. This serves as a network I/O test. Note that because we are operating the benchmark without a telescope configuration, the entire grid is going to get transferred - not just the pieces of it that have baselines.

Other options for distributed mode:

options="--vis-set=lowbd2"

Will only do re-distribute data that overlaps with baselines, then do degridding.

options="--vis-set=lowbd2 /local/out%d.h5"

Also write out visibilities to the given file. Note that the benchmark does not currently implement parallel HDF5, so different streamer processes will have to write separate output files. The name can be made dependent on streamer ID by putting a %d placeholder into it so it won’t cause conflicts on shared file systems.

options="--vis-set=lowbd2 --fork-writer --writer-count=4 /local/out%d.h5"

This will create 4 writer processes for each subgrid worker and writes the data to the file system. Remember that without --fork-writer option, the benchmark will create only threads and HDF5 library currently do not have support for concurrent writing threads. So, it will not increase the data throughput.

SKA1 LOW and MID settings

To run the benchmark that correspond to SKA1 LOW and SKA1 MID settings following configuration can be used. These settings are provided assuming we are running on 16 compute nodes with 2 NUMA nodes and 32 cores on each compute node and --rec-set=small.

  • SKA LOW:

    export OMP_NUM_THREADS=16
    mpirun --map-by node -np 42 ./iotest --rec-set=small --vis-set=lowbd2 --facet-workers=10 --time=-460:460/1024/64 --freq=260e6:300e6/8192/64 --dec=-30 --source-count=10 --send-queue=4 --subgrid-queue=16 --bls-per-task=8 --task-queue=32 --target-err=1e-5 --margin=32`
    
  • SKA MID:

    export OMP_NUM_THREADS=16
    mpirun --map-by node -np 42 ./iotest --rec-set=small --vis-set=midr5 --facet-workers=10 --time=-290:290/4096/64 --freq=0.35e9:0.4725e9/11264/64 --dec=-30 --source-count=10 --send-queue=4 --subgrid-queue=16 --bls-per-task=8 --task-queue=32 --target-err=1e-5 --margin=32
    

The above configurations run the benchmark in the dry mode without writing the visibility data to the file system. If we want to write the data, we have to add /some_scratch/out%d.h5 to the end, where /some_scratch is the scratch directory of the file system. SKA LOW has 131,328 baselines and with above configuration 1,101,659,111,424 (131,328x1024x8192) visibilities will be produced which correspond to roughly 14 TB of data. SKA MID will have 19,306 baselines, so 890,727,563,264 visibilities which is 17 TB of data. The amount of generated data can be effected by the chunk size used. Bigger chunk size involves more subgrids which eventually require some re-writes.

Running with singularity image

Singularity image can be pulled from GitLab registry as shown in Prerequisites and Installation. Currently, the singularity image supports three different entry points, which are defined using apps feature from SCIF. Three entry points are as follows:

  • openmpi-3.1-ibverbs: OpenMPI 3.1.3 is built inside the container with IB verbs support for high performance interconnect.

  • openmpi-4.1-ucx: OpenMPI 4.1.0 with UCX is build inside the container

  • openmpi-3.1-ibverbs-haswell: This is similar to openmpi-3.1-ibverbs, albeit, the imaging I/O test code is compiled with haswell microarchitecture instruction set. This default entrypoint to the container unless otherwise specified.

To list all the apps installed in the container, we can use singularity inspect command as

singularity inspect --list-apps iotest.sif

Typically, singularity can be run using MPI as follows:

mpirun -np 2 singularity run --env OMP_NUM_THREADS=8 --bind /scratch --app ompi-3.1-ibverbs iotest.sif ${ARGS}

The above command launches two MPI processes with 8 OpenMP threads with entry point defined in ompi-3.1-ibverbs of iotest.sif image and ${ARGS} correspond to typical Imaging I/O test arguments presented above. If visibility data is written to non-standard directories, it is necessary to bind the directory using --bind option as shown in the command.