Level 0 Benchmark Tests
CPU tests
HPCG benchmark
Context
This is HPCG microbenchmark test with a default problem size of 104. The benchmark does several matrix vector operations on sparse matrices. More details about the benchmark can be found at HPCG benchmark website.
Note
Currently, the implemented test uses the optimized version of benchmark shipped by Intel MKL library. We use GNU compiler toolchain for running the test on AMD processors. For the case of IBM POWER processors, IBM shipped XL compiler toolchain is used to run the benchmark
Test size
Currently, two different variables can be controlled for running tests. They are
number of nodes to run the benchmark
problem size
By default, benchmark will run on single node. If the user wants to run on multiple nodes, we can
set the num_nodes
variable at the run time. Similarly, the default problem size is 104 and it
can be set at runtime using problem_size
variable.
Note
Even if more than one node is used in the test, the resulting performance metric, Gflop/s
,
is always reported per node.
Note
The problem size must be a multiple of 8. This is the requirement from the HPCG benchmark per se.
Test types
Currently there are three different types of tests implemented:
HpcgXlTest
: HPCG with IBM XL toolchainHpcgGnuTest`
: HPCG with GNU GCC toolchainHpcgMklTest
: HPCG shipped with Intel MKL package
If a system has two valid tests, we can restrict the test using -n
flag on the CLI. It is
shown in Usage.
Usage
The test can be run using following commands.
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/cpu/hpcg/reframe_hpcg.py --run --performance-report
We can set number of nodes on the CLI using:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/cpu/hpcg/reframe_hpcg.py --run --performance-report -S num_nodes=2
Similarly, problem size of the benchmark can be altered at runtime using:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/cpu/hpcg/reframe_hpcg.py --run --performance-report -S problem_size=120
For instance, if a system has both HpcgGnuTest
and HpcgMklTest
as valid tests and if we want to run
only HpcgMklTest
, we can use -n
flag as follows:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/cpu/hpcg/reframe_hpcg.py --run --performance-report -n HpcgMklTest
Test class documentation
- class apps.level0.cpu.hpcg.reframe_hpcg.HpcgMixin(*args, **kwargs)[source]
Common regression test attributes for HpcgGnuTest and HpcgMklTest
- class apps.level0.cpu.hpcg.reframe_hpcg.HpcgXlTest(*args, **kwargs)[source]
Main class of HPCG test based on IBM Xl
- class apps.level0.cpu.hpcg.reframe_hpcg.HpcgGnuTest(*args, **kwargs)[source]
Main class of HPCG test based on GNU
- class apps.level0.cpu.hpcg.reframe_hpcg.HpcgMklTest(*args, **kwargs)[source]
Main class of HPCG test based on MKL
HPL benchmark
Context
This is HPL microbenchmark test with using a single node as default test parameter. It is used as reference benchmark to provide data for the Top500 list and thus rank to supercomputers worldwide. HPL rely on an efficient implementation of the Basic Linear Algebra Subprograms (BLAS).
Note
Currently, the implemented test uses the optimized version of benchmark shipped by Intel MKL library.
Test types
Currently, two different tests are defined namely,
HplGnuTest
: Based on GNU toolchain for non Intel processorsHplMklTest
: Using benchmark shipped out of Intel MKL library for Intel processors
On Intel chips, we can use the precompiled binary that comes out-of-the-box from Intel MKL
library. But for non Intel systems, we need to compile using GNU tool chain with customized
make file. In the directory makes/
, we provide the make file for AMD chips using BLIS
as BLAS library. We can choose which test to run during runtime using CLI which is discussed in
Usage.
Test configuration file
The prerequisite to run HPL benchmark is HPL.dat
file that contains several benchmark
parameters. A sample configuration file looks like
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
4 # of problems sizes (N)
29 30 34 35 Ns
4 # of NBs
1 2 3 4 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
3 # of process grids (P x Q)
2 1 4 Ps
2 4 1 Qs
16.0 threshold
3 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
2 # of recursive stopping criterium
2 4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
3 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
0 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
0 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
More details on each parameter can be found in the
tuning section of benchmark
documentation. This link
can be used to generate a HPL.dat
file for a given runtime configuration. Another useful link
in this context is here.
Currently, the test supports automatic generation of HPL.dat
file based on the system
configuration. The class GenerateHplConfig
in modules.utils
is used for this
purpose. The problem size of HPL is dependent on the available system memory and it is
generally recommended to set a size that occupies at least 80% of the system memory. It means
for systems that have big memory, a very huge problem size can be generated which can take a very
long for the benchmark to run. Thus, we capped the system memory to 200 GB to avoid these very long
run times.
For the intel processors, we use one MPI process per node but for AMD chips, we use number of L3 caches as number of MPI processes and use number of cores attached to each L3 cache as number of OpenMP threads.
Usage
The test can be run using following commands.
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/cpu/hpl/reframe_hpl.py --run --performance-report
We can set number of nodes on the CLI using:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/cpu/hpl/reframe_hpl.py --run --performance-report -S num_nodes=2
To choose a particular test during runtime using -n
option as follows:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/cpu/hpl/reframe_hpl.py --run --performance-report -n HplGnuTest
Test class documentation
- class apps.level0.cpu.hpl.reframe_hpl.HplMixin(*args, **kwargs)[source]
Common methods and attributes for HPL main tests
- set_sanity_patterns()[source]
Set sanity patterns. Example stdout:
# Finished 1 tests with the following results: # 1 tests completed and passed residual checks, # 0 tests completed and failed residual checks, # 0 tests skipped because of illegal input values # -------------------------------------------------------------------------------- # End of Tests.
- class apps.level0.cpu.hpl.reframe_hpl.HplGnuTest(*args, **kwargs)[source]
Main class of HPL test based on GNU toolchain
- class apps.level0.cpu.hpl.reframe_hpl.HplMklTest(*args, **kwargs)[source]
Main class of HPL test based on MKL
Intel MPI Benchmarks
Context
Intel MPI Benchmarks (IMB) are used to measure application-level latency and bandwidth, particularly over a high-speed interconnect, associated with a wide variety of MPI communication patterns with respect to message size.
Note
Currently, only benchmarks from IMB-MPI1
components are included in the test.
Included benchmarks
Currently, the test includes following benchmarks:
Pingpong
Uniband
Biband
Sendrecv
Allreduce
Alltoall
Allgather
By default all the above listed benchmarks will be run by the test. However, the user can choose subset of these benchmarks at the runtime using CLI. This will be discussed in Usage.
Number of MPI processes
By default benchmarks like Uniband, Biband, etc, are run with MPI processes varying from 2, 4, 8 and so on until the number of physical cores on the nodes. In order to reduce total number of benchmarks, only two runs for each benchmark is chosen
Run with 1 MPI process per node
Run with
N
MPI processes per node whereN
is number of physical cores.
Effectively using this configuration, we are running test that establish upper and lower bounds of benchmark metrics and thus minimising the time required for benchmarks to run.
Test configuration
The only file that is needed for the test to run is placed in src/
folder which provides the
list of message sizes to be tested in the benchmark. The file must be as follows:
0
4096
16384
131072
1048576
4194304
If we want to test for more message sizes, simply add new lines in the file and place it in
src/
folder.
Test variables
Different variables are available for the user to change the runtime configuration of the tests. They are listed as follows:
variants
: Benchmark variants that can be chosen as listed in Included benchmarks (default: All benchmarks listed in Included benchmarks).mem
: Memory allocated per MPI process (default: 1).timeout
: Timeout for running the benchmark for each message size (default: 2).
The variables mem
and timeout
are specific to IMB and more details about these variables
can be found in the documentation.
Tip
For benchmarks like Alltoall, Allgather, Allreduce involving many nodes, runs with bigger message
sizes might timeout. In this case increase the timeout
variable. Similarly, nodes with many
cores and smaller memory size can pose problems when running benchmarks. As stated
in Number of MPI processes, as many MPI processes as number of physical cores are used for benchmark
runs. So, if the node has N
physical cores and less than N GB
of DRAM, benchmarks will
fail due to lack of sufficient memory. In this case reduce the mem
variable which reduces
the memory allocated for each MPI process.
All these variables can be configured at the runtime from the CLI using -S
flag of ReFrame.
It will be discussed in Usage.
Test parameterisation
The tests are parameterised based on the variable tot_nodes
. The current default value is 2.
This variable can only be configured from CLI using environment variable IMBTEST_NODES
.
Based on the tot_nodes
value, the closest power of 2 is estimated and parameterised tests on
different number of nodes generated in the powers of 2 are used. For instance, if tot_nodes
is
64, tests are run on 2, 4, 8, 16, 32 and 64 nodes. For each run, all the requested benchmarks will
be executed with different number of MPI processes as described in Number of MPI processes.
If the user wants to restrict number of nodes to only few runs, we can do it using -t
flag on
the CLI.
Usage
The test can be run using following commands.
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/cpu/imb/reframe_imb.py --exec-policy=serial --run --performance-report
Important
It is absolutely necessary to use --exec-policy=serial
option while running these benchmarks.
By default ReFrame will execute tests in asynchronous mode, where all tests are executed at the
same time. As we are interested in network latency and bandwidth metrics, it is advised to run
these benchmarks serially so that they do not interfere with each other.
We can choose benchmark variants from CLI. For example, if we want to run only Uniband and Biband benchmarks:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/cpu/imb/reframe_imb.py --exec-policy=serial --run --performance-report -S variants="Uniband","Biband"
Similarly other variables can also be configured from CLI. To use mem
as 0.5 and timeout
as 3.0:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/cpu/imb/reframe_imb.py --exec-policy=serial --run --performance-report -S mem=0.5 -S timeout=3.0
To set the total number of nodes from CLI using IMBTEST_NODES
environment variable, use following:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
IMBTEST_NODES=16 reframe/bin/reframe -C reframe_config.py -c apps/level0/cpu/imb/reframe_imb.py --exec-policy=serial --run --performance-report
Finally, to select only few parameterised tests, we can use -t
flag. For example, if
tot_nodes
is set to 16 and if we want to run only tests where number of nodes are 8 and 16,
we can do following:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
IMBTEST_NODES=16 reframe/bin/reframe -C reframe_config.py -c apps/level0/cpu/imb/reframe_imb.py --exec-policy=serial --run --performance-report -t 8$ -t 16$
All the above mentioned CLI flags can be used together without any side effects.
Test class documentation
- class apps.level0.cpu.imb.reframe_imb.ImbMixin(*args, **kwargs)[source]
Common test attributes for IMB test
- class apps.level0.cpu.imb.reframe_imb.ImbPingpongTest(*args, **kwargs)[source]
Main class of IMB Pingpong test
- set_sanity_patterns()[source]
Set sanity patterns. We override the method in Mixin Class. Example stdout:
- set_perf_patterns()[source]
Set performance variables. We override the method in Mixin Class. Sample stdout
# e.g. # #--------------------------------------------------- # # Benchmarking PingPong # # #processes = 2 # #--------------------------------------------------- # #bytes #repetitions t[usec] Mbytes/sec # 0 1000 3.51 0.00 # #---------------------------------------------------
- class apps.level0.cpu.imb.reframe_imb.ImbOneCoreTests(*args, **kwargs)[source]
Main class of all IMB variants tests using one core per node
- class apps.level0.cpu.imb.reframe_imb.ImbAllCoreTests(*args, **kwargs)[source]
Main class of IMB variants tests using all cores per node
IOR benchmark
Context
IOR is designed to measure parallel file system I/O performance through a variety of potential APIs. This parallel program performs writes and reads to/from files and reports the resulting throughput rates. The tests are configured in such a way to minimise the page caching effect on I/O bandwidth. See here for more details.
Note
In order to run this test, an environment variable SCRATCH_DIR
must be defined in the system
partition with the path to the scratch directory of the platform. Otherwise the test will fail.
Test variables
Several variable are defined in the tests which can be configured from the command line interface (CLI). They are summarised as follows:
num_nodes
: Number of nodes to run the test (default is 4)num_mpi_tasks_per_node
: Number of MPI processes per node (default is 8)block_size
: Block size of IOR test (default is 1g)transfer_size
: Transfer size of IOR test (default is 1m)num_segments
: Number of segments of IOR test (default is 1)
The variables block_size
, transfer_size
and num_segments
are IOR related. More details
on these variables can be found at IOR documentation.
Any of these variables can be overridden from the CLI using -S
option of ReFrame. The examples
are presented in Usage.
Test parameterisation
The test is parameterised with respect to two parameters namely I/O interface and file type. There are 3 different I/O interfaces available
posix
: POSIX I/Ompiio
: MPI I/Ohdf5
: HDF5
We can write data to a single file or use file-per-process approach and tests are parameterised as follows:
single
: Single file for all processesfpp
: File per process
The parameterised tests can be controlled by tags which will be shown in the Usage section.
Usage
The test can be run using following commands.
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/cpu/ior/reframe_ior.py --exec-policy=serial --run --performance-report
Note
It is extremely important to use --exec-policy=serial
for this particular test. By default,
ReFrame executes the tests in asynchronous mode
which means multiple jobs are executed at the same time if partition allows to do so. However,
for this type of IO test, we do not want all the jobs using the underlying file system at the same
time. So, we switch to serial execution where only one job at a time is executed on the partition.
To configure the test variables presented in Test variables section we can use -S
option as
follows:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/cpu/ior/reframe_ior.py --exec-policy=serial --run --performance-report -S num_nodes=2
Multiple variables can be configured simple by repeating -S
flag for each variable as follows:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/cpu/ior/reframe_ior.py --exec-policy=serial --run --performance-report -S num_nodes=2 -S block_size=10g
By default all parameterised tests will be executed for a given partition. The list of tests can be obtained using:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/cpu/ior/reframe_ior.py -l
which will give following output :
[ReFrame Setup]
version: 3.9.0-dev.3+adca255d
command: 'reframe/bin/reframe -C reframe_config.py -c apps/level0/cpu/ior/reframe_ior.py -l'
launched by: mahendra@alaska-login-0.novalocal
working directory: '/home/mahendra/work/ska-sdp-benchmark-tests'
settings file: 'reframe_config.py'
check search path: (R) '/home/mahendra/work/ska-sdp-benchmark-tests/apps/level0/cpu/ior/reframe_ior.py'
stage directory: '/home/mahendra/work/ska-sdp-benchmark-tests/stage'
output directory: '/home/mahendra/work/ska-sdp-benchmark-tests/output'
[List of matched checks]
- IorTest_hdf5_single (found in '/home/mahendra/work/ska-sdp-benchmark-tests/apps/level0/cpu/ior/reframe_ior.py')
- IorTest_posix_single (found in '/home/mahendra/work/ska-sdp-benchmark-tests/apps/level0/cpu/ior/reframe_ior.py')
- IorTest_mpiio_single (found in '/home/mahendra/work/ska-sdp-benchmark-tests/apps/level0/cpu/ior/reframe_ior.py')
- IorTest_posix_fpp (found in '/home/mahendra/work/ska-sdp-benchmark-tests/apps/level0/cpu/ior/reframe_ior.py')
- IorTest_mpiio_fpp (found in '/home/mahendra/work/ska-sdp-benchmark-tests/apps/level0/cpu/ior/reframe_ior.py')
- IorTest_hdf5_fpp (found in '/home/mahendra/work/ska-sdp-benchmark-tests/apps/level0/cpu/ior/reframe_ior.py')
Found 6 check(s)
Log file(s) saved in '/home/mahendra/work/ska-sdp-benchmark-tests/reframe.log', '/home/mahendra/work/ska-sdp-benchmark-tests/reframe.out'
As we can see from the output, ReFrame will execute all IO types and file type tests. In order to
choose only few parameterised tests, we can use -t
flag to restrict the tests to given
parameters. For example, to run only POSIX IO interface and Single file variant
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/cpu/ior/reframe_ior.py --exec-policy=serial --run --performance-report -t posix$ -t single$
can be used. As in the case of -S
option, -t
can also be repeated as many times as user
want.
Test class documentation
- class apps.level0.cpu.ior.reframe_ior.IorTest(*args, **kwargs)[source]
Main class of IOR read and write tests
- set_sanity_patterns()[source]
Set sanity patterns. Example stdout
# Max Write: 940.74 MiB/sec (986.44 MB/sec) # Max Read: 1303.68 MiB/sec (1367.01 MB/sec) # Finished : Mon Oct 18 10:52:25 2021
- extract_write_bw()[source]
Performance extraction function for extract write bandwidth. Sample stdout
# Max Write: 940.74 MiB/sec (986.44 MB/sec)
STREAM benchmark
Context
STREAM is used to measure the sustainable memory bandwidth of high performance computers. The source code is available here.
Note
Currently, the implemented test uses only Intel compiler that is optimized for Intel processors. A generic GNU compiled stream test will be added in the future.
Test configuration
STREAM benchmark uses 3 arrays of size N
to perform different kernels. The most relevant and
interesting kernel is “Triad” kernel. In the test we use the size of the arrays in such a way that
they occupy 60 % of the system memory. In this way, we are sure that caching effects are avoided
while running the benchmark.
The Makefile
in the src/
folder contains all the optimized compiler flags used for Intel
compiler to extract maximum peformance.
Usage
The test can be run using following commands.
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/cpu/stream/reframe_stream.py --run --performance-report
Test class documentation
- class apps.level0.cpu.stream.reframe_stream.StreamTest(*args, **kwargs)[source]
Main class of Stream test based on Intel compiler
- set_sanity_patterns()[source]
Set sanity patterns. Example stdout:
# ------------------------------------------------------------- # Solution Validates: avg error less than 1.000000e-13 on all three arrays # -------------------------------------------------------------
- extract_bw(kind='Copy')[source]
Performance function to extract bandwidth. Sample stdout:
# Function Best Rate MB/s Avg time Min time Max time # Copy: 42037.6 0.003859 0.003806 0.004004 # Scale: 41047.7 0.003917 0.003898 0.003942 # Add: 45138.5 0.005347 0.005317 0.005372 # Triad: 46412.1 0.005202 0.005171 0.005238
GPU tests
Babel Stream benchmark
Context
Babel Stream is inspired from STREAM benchmark to measure the memory bandwidth on GPUs. It supports several other programming models for CPUs as well. More details can be found in the documentation.
Note
Although the benchmark supports various programming models, currently the test uses only OMP, TBB and CUDA models.
Test variants
Currently, three different variants of the benchmark are included in the test. They are
omp
: Using OpenMP threading modeltbb
: Using Intel’s TBB modelcuda
: Using CUDA model for GPUs
The test is parameterised for these models and a specific test can be chosen at the runtime using
-t
flag on CLI. An example is shown in the Usage.
Test configuration
Like STREAM benchmark, Babel stream uses 3 arrays of size N
for different kernels. The size
of the arrays that will be used in the benchmark kernels can be configured at the run time using
mem_size
variable. Currently, the default value for mem_size
is 0.4, which means the
array size is chosen in such a way that all three arrays will occupy 40 % of total memory available.
Note
Depending on the GPU, sometimes we might get an error saying not enough space availble to store
buffers. Decrease the mem_size
in that case to allocate smaller arrays.
Usage
The test can be run using following commands.
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/gpu/babel_stream/reframe_babelstream.py --run --performance-report
To run only omp
variant and skip rest of the models, use -t
flag as follows:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/gpu/babel_stream/reframe_babelstream.py -t omp$ --run --performance-report
To change the default value of mem_size
during runtime, use -S
flag. For example to use 30% of total memory:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/gpu/babel_stream/reframe_babelstream.py -S mem_size=0.3 --run --performance-report
Test class documentation
- class apps.level0.gpu.babel_stream.reframe_babelstream.BabelStreamTest(*args, **kwargs)[source]
Babel stream test main class
- set_sanity_patterns()[source]
Set sanity patterns. Example stdout:
# BabelStream # Version: 3.4 # Implementation: OpenMP # Running kernels 100 times # Precision: double # Array size: 268.4 MB (=0.3 GB) # Total size: 805.3 MB (=0.8 GB) # Function MBytes/sec Min (sec) Max Average # Copy 69666.441 0.00771 0.01172 0.00783 # Mul 67689.368 0.00793 0.01323 0.00811 # Add 75708.142 0.01064 0.01792 0.01090 # Triad 76265.085 0.01056 0.01411 0.01071 # Dot 103668.530 0.00518 0.01109 0.00547
GPUDirect RDMA Benchmark Tests
Context
The GPUDirect RDMA (GDR) technology exposes GPU memory to I/O devices by enabling the direct communication path between GPUs in two remote systems. This feature eliminates the need to use the system CPUs to stage GPU data in and out intermediate system memory buffers. As a result the end-to-end latency is reduced and the sustained bandwidth is increased (depending on the PCIe topology).
The GDRCopy (GPUDirect RDMA Copy) library leverages the GPUDirect RDMA APIs to create CPU memory mappings of the GPU memory. The advantage of a CPU driven copy is the very small overhead involved. That is helpful when low latencies are required.
Note
OSU micro benchmark suite is used to test the GDR capabilities in the current test setting. A more lower level verbs tests can also be used if the user wishes to remove the overhead imposed by MPI.
Included benchmarks
Currently, the test includes following categories of benchmarks:
Type of benchmark:
bw
: Uni directional bandwidth testbibw
: Bi directional bandwidth testlatency
: Latency test
Communication type:
D_D
: Device to deviceD_H
: Device to hostH_D
: Host to device
By default all the combination of tests will be performed. Both types of tests are parameterised and the user can select one or more of these tests at the run time using tags. This will be discussed in Usage.
Each of these tests will be executed in four different modes:
GPUDirect RDMA and GDR Copy Enabled
GPUDirect RDMA Enabled and GDR Copy Disabled
GPUDirect RDMA Disabled and GDR Copy Enabled
GPUDirect RDMA and GDR Copy Disabled
This will enable us investigate the effect of each component on the bandwidth and latency.
Benchmark configuration
There are two important variables for this test that need to be taken care of. They are
net_adptr
: Network adapter to use (default:mlx5_0:1
)ucx_tls
: UCX transport modes (default:['rc', 'cuda_copy']
)
The value for net_adptr
can be passed in two different ways:
In the system/partition configuration as the key value pair in
extras
field.Other option would be to use
-S
flag at CLI to set the variable.
The value defined using -S
flag has precedence over the system configuration value.
If none of them are set, default value is used in the test. An example on how to define in extras
is as follows:
'extras': {
'interconnect': '100', # in Gb/s
'gpu_mem': '42505076736', # in bytes
'gdr_test_net_adptr': 'mlx5_0:1', # NIC that has end-to-end connectivity for GDR test
}
An optimal settings of these variables is necessary in order to leverage the available bandwidth of Infiniband (IB) stack. We should choose the network adapter that has end-to-end connectivity with GPUs.
Tip
We can get this information from nvidia-smi topo -m
command output. A typical output from
this command can be as follows:
GPU0 mlx5_0 mlx5_1 mlx5_2 mlx5_3 mlx5_4 mlx5_5 CPU Affinity NUMA Affinity
GPU0 X NODE NODE PIX PIX PIX PIX 0-19 0
mlx5_0 NODE X PIX NODE NODE NODE NODE
mlx5_1 NODE PIX X NODE NODE NODE NODE
mlx5_2 PIX NODE NODE X PIX PIX PIX
mlx5_3 PIX NODE NODE PIX X PIX PIX
mlx5_4 PIX NODE NODE PIX PIX X PIX
mlx5_5 PIX NODE NODE PIX PIX PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
We have to make sure to use adapter with the PIX attribute (single PCIe bridge). In this case mlx5_2, mlx5_3, mlx5_4 and mlx5_5 are directly connected to PCI express and we can choose any of them
Similarly, for the case of UCX transport methods, we can choose the ones that are available on the
system. This information can be gathered using ucx_info -d
which lists all the available
transport methods. These default variables can be overridden from CLI which will be shown in
Usage.
Usage
The test can be run using following commands.
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/gpu/gdr_test/reframe_gdr.py --run --performance-report
If we want to set ucx_tls
to ['dc', 'cuda-copy']
and net_adptr
to mlx5_3:1
we can
use -S
flag as follows
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/gpu/gdr_test/reframe_gdr.py -S net_adptr=mlx5_3:1 -S ucx_tls=dc,cuda_copy --run --performance-report
Similarly, if we want to restrict the tests to only D_D
(device to device) and bw
(uni bandwidth), we can use tags as follows
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/gpu/gdr_test/reframe_gdr.py -t D_D -t bw$ --run --peformance-report
Test class documentation
- class apps.level0.gpu.gdr_test.reframe_gdr.GpuDirectRdmaTest(*args, **kwargs)[source]
GPU Direct RDMA test to benchmark bandwidth and latency between inter node GPUs
- override_net_adptr_from_sys_config()[source]
Override network adapter variable if found in sys config
- set_prerun_cmds()[source]
Set prerun commands. Set env variables for case of RDMA and GDR copy enabled
- set_sanity_patterns()[source]
Set sanity patterns. Example stdout:
# Test with RDMA_GDR_Copy_Enabled started # OSU MPI-CUDA Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) # 1 1.89 # 2 3.80 # 4 7.65 # 8 15.25 # Test with RDMA and GDR Copy Enabled finished
- extract_bw(msg_size=1, case='RDMA_GDR_Copy_Enabled')[source]
Performance function to extract uni bandwidth
- extract_bibw(msg_size=1, case='RDMA_GDR_Copy_Enabled')[source]
Performance function to extract bi bandwidth
NCCL performance benchmarks
Context
NCCL is a stand-alone library of standard communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP sockets.
In this test, we are only interested in the intra-node communication latencies and bandwidths and so, we run this test on single node with multiple GPUs. The benchmarks report the so-called bus bandwidth that can be used to compare with underlying hardware peak bandwidth for collective communications. More details on how the bus bandwidth is estimated can be found at nccl tests repository.
Note
Each benchmark runs in two different modes namely, in-place and out-of-place. An in-place operation uses the same buffer for its output as was used to provide its input. An out-of-place operation has distinct input and output buffers.
Test variants
The test is parameterised to run following communication benchmarks:
sendrecv
gather
scatter
reduce
all_gather
all_reduce
A specific test can be chosen at the runtime using -t
flag on CLI. An example is shown in the
Usage.
Test configuration
The tests can be configured to change the minimum and maximum sizes of the messages that will be
used in benchmarks. They can be configured at the runtime using min_size
and max_size
variables. The default values are 8 bytes and 128 MiB, respectively.
Usage
The test can be run using following commands.
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/gpu/nccl_test/reframe_nccltest.py --run --performance-report
To run only scatter
and gather
variants and skip rest of the benchmarks, use -t
flag as
follows:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/gpu/nccl_test/reframe_nccltest.py -t scatter$ -t gather$ --run --performance-report
To change the default value of min_size
during runtime, use -S
flag. For example to use
1 MiB of min_size
:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/gpu/nccl_test/reframe_nccltest.py -S min_size=1M --run --performance-report
Test class documentation
- class apps.level0.gpu.nccl_test.reframe_nccltest.NcclTestDownload(*args, **kwargs)[source]
Fixture to fetch NCCL test source code
- class apps.level0.gpu.nccl_test.reframe_nccltest.NcclTestBuild(*args, **kwargs)[source]
NCCL tests compile test
- class apps.level0.gpu.nccl_test.reframe_nccltest.NcclPerfTest(*args, **kwargs)[source]
NCCL performance tests main class
- set_sanity_patterns()[source]
Set sanity patterns. Example stdout:
# # Out of bounds values : 0 OK # # Avg bus bandwidth : 0.791943
- extract_algbw(msg_size=None, place='in')[source]
Performance function to extract algorithmic bandwidth
- set_perf_patterns()[source]
Set performance variables. Sample stdout:
# # nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 # # # # Using devices # # Rank 0 Pid 16042 on grouille-1 device 0 [0x21] A100-PCIE-40GB # # Rank 1 Pid 16042 on grouille-1 device 1 [0x81] A100-PCIE-40GB # # # # out-of-place in-place # # size count type time algbw busbw error time algbw busbw error # # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) # 8 2 float 23.17 0.00 0.00 0e+00 22.75 0.00 0.00 0e+00 # 16 4 float 22.65 0.00 0.00 0e+00 22.68 0.00 0.00 0e+00 # 32 8 float 22.44 0.00 0.00 0e+00 22.54 0.00 0.00 0e+00 # 64 16 float 22.83 0.00 0.00 0e+00 22.37 0.00 0.00 0e+00 # 128 32 float 22.72 0.01 0.01 0e+00 22.64 0.01 0.01 0e+00 # 256 64 float 22.67 0.01 0.01 0e+00 22.47 0.01 0.01 0e+00 # 512 128 float 22.42 0.02 0.02 0e+00 22.26 0.02 0.02 0e+00 # 1024 256 float 22.63 0.05 0.05 0e+00 22.50 0.05 0.05 0e+00 # 2048 512 float 22.47 0.09 0.09 0e+00 22.52 0.09 0.09 0e+00 # 4096 1024 float 23.33 0.18 0.18 0e+00 23.28 0.18 0.18 0e+00 # 8192 2048 float 25.22 0.32 0.32 0e+00 24.90 0.33 0.33 0e+00 # 16384 4096 float 33.57 0.49 0.49 0e+00 33.95 0.48 0.48 0e+00 # 32768 8192 float 48.07 0.68 0.68 0e+00 49.19 0.67 0.67 0e+00 # 65536 16384 float 68.66 0.95 0.95 0e+00 72.52 0.90 0.90 0e+00 # 131072 32768 float 115.3 1.14 1.14 0e+00 114.1 1.15 1.15 0e+00 # 262144 65536 float 176.9 1.48 1.48 0e+00 174.8 1.50 1.50 0e+00 # 524288 131072 float 334.0 1.57 1.57 0e+00 342.7 1.53 1.53 0e+00 # 1048576 262144 float 643.3 1.63 1.63 0e+00 599.7 1.75 1.75 0e+00 # 2097152 524288 float 1125.6 1.86 1.86 0e+00 1077.9 1.95 1.95 0e+00 # 4194304 1048576 float 2813.4 1.49 1.49 0e+00 2670.2 1.57 1.57 0e+00 # 8388608 2097152 float 5561.3 1.51 1.51 0e+00 5497.1 1.53 1.53 0e+00 # 16777216 4194304 float 11070 1.52 1.52 0e+00 10950 1.53 1.53 1e+00 # 33554432 8388608 float 22215 1.51 1.51 0e+00 22687 1.48 1.48 1e+00 # 67108864 16777216 float 45987 1.46 1.46 0e+00 46600 1.44 1.44 1e+00 # 134217728 33554432 float 95433 1.41 1.41 0e+00 96707 1.39 1.39 1e+00 # # Out of bounds values : 0 OK # # Avg bus bandwidth : 0.778536 # #
Funclib Test
Context
This test runs functions from ska-sdp-func <https://gitlab.com/ska-telescope/sdp/ska-sdp-func>. Currently implemented are tests for DFT and Phase Rotation.
Test variables
The DFT test supports two different polarisations as parameters.
The Phaserotation test supports two parameters, which are configured as tuples in one ReFrame parameter. Those two parameters are “baselines” which are the number of baselines to be tested and “times”.
Environment variables
By default, the test will create a conda
environment and run inside it for the
sake of isolation. This can be controlled using env variable CREATE_CONDA_ENV
.
By setting it to NO
, the test WILL NOT create a conda
environment.
Similarly, the performance metrics are monitored using the perfmon toolkit. If
the user does not want to monitor metrics, it can be achieved by setting
MONITOR_METRICS=NO
.
Usage
The tests can be run using the following commands:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/gpu/hippo_func_lib/reframe_funclib_test.py --run --performance-report
If we want to change the variables to non default values, we should use -S
flag. For example, if we want to run only 5 major cycles and 64 frequency channels, use
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level0/gpu/hippo_func_lib/reframe_funclib_test.py -S start=1000000 --run --performance-report
Test class documentation
- class apps.level0.gpu.hippo_func_lib.reframe_funclib_test.FunclibTestDownload(*args, **kwargs)[source]
Fixture to fetch ska-sdp-func source code
- class apps.level0.gpu.hippo_func_lib.reframe_funclib_test.FunclibTestBuild(*args, **kwargs)[source]
Funclib test compile test
- class apps.level0.gpu.hippo_func_lib.reframe_funclib_test.FunclibDftTest(*args, **kwargs)[source]
Level 1 Benchmark Tests
CUDA NIFTY gridder performance benchmark
Context
CUDA NIFTY Gridder (CNG) is a CUDA implementation of NIFTY gridder to (de)grid interferometric data using improved w-stacking algorithm.
In this test, we are interested in performance of CNG on different GPU devices. In order to stress
the gridder, we use SKA1 MID synthetic dataset with configurable image size. More details on the
design of the benchmark can be found in src/
folder.
Note
The benchmark test uses visibility data that is randomly generated for a given uvw coverage. It is very expensive to do a DFT on this data to estimate the accuracy of the CNG. Hence, no accuracy tests are performed within this benchmark.
Test configuration
The tests can be configured to change the minimum and maximum number of frequency channels that will be used in the benchmark. Similarly, the image size can also be configured at the runtime. Available variables that can be configured at runtime are
min_chans
: Minimum number of frequency channels as power of 2 (default: 0)max_chans
: Maximum number of frequency channels as power of 2 (default: 11)img_size
: Image size as multiple of 1024 (default: 8)
With default variables, the benchmark will test on a image size of 8192 x 8192 pixels using 1 to
1024 frequency channels. They can be configured from CLI using -S
flag which will be shown in
Usage.
Usage
The test can be run using following commands.
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level1/cng_test/reframe_cngtest.py --run --performance-report
To run using 16k image and till 4096 frequency channels, use -S
option as
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level1/cng_test/reframe_cngtest.py -S img_size=16 -t max_chans=13 --run --performance-report
Test class documentation
- class apps.level1.cng_test.reframe_cngtest.CngTest(*args, **kwargs)[source]
CUDA NIFTY Gridder (CNG) performance tests main class
- set_sanity_patterns()[source]
Set sanity patterns. Example stdout:
All tests have successfully finished
- extract_time(num_chan=None, perf='invert')[source]
Performance function to extract (de)gridding times
- extract_vis(num_chan=None, perf='vis')[source]
Performance function to extract number of visibilities
- set_perf_patterns()[source]
Set performance variables. Sample stdout:
--------------------------------------------------------------------------------------- # Image size: 4096 x 4096 # Pixel size (in degrees): 4.099e-05 # Field of view (in degrees): 1.679e-01 # Minimum frequency: 1.300e+09 # Maximum frequency: 1.360e+09 # Number of baselines: 312048 # Integration interval (in sec): 1800 # Precision: sp # Accuracy: 1e-05 # Number of iterations: 10 ======================================================================================= CUDA NIFTY Benchmark results using synthetic SKA1 MID dataset ======================================================================================= # # Channels # Visibilities Invert time [s] Predict time [s] # 1 312048 0.30428 0.09089 # 2 624096 0.33928 0.09925 # 4 1248192 0.34887 0.10189 # 8 2496384 0.38257 0.11387 # 16 4992768 0.43269 0.14284 # 32 9985536 0.53047 0.17708 # 64 19971072 0.73875 0.26098 # 128 39942144 1.26618 0.44106 # 256 79884288 2.17768 0.74469 # 512 159768576 4.34335 1.34687 # 1024 319537152 8.66043 2.64405 --------------------------------------------------------------------------------------- # End of table # All tests have successfully finished
IDG Test
Context
The image-domain gridder (IDG) is a new, fast gridder that makes w-term correction and a-term correction computationally very cheap. It performs extremely well on gpus. The source code is hosted on ASTRON GitLab repository and documentation can be found here.
Test variables
The test supports several runtime configurable variables:
layout
: Antenna layout. Available options areSKA1_low
andSKA1_mid
. (default is SKA1_low)num_cycles
: Number of major cycles (default: 10)num_stations
: Number of antenna stations (default: 100)gridsize
: Gridsize used for IDG (default: 8192)num_chans
: Number of frequency channels (default: 128)
This benchmark uses either SKA1_low
or SKA1_mid
antenna layout and generate random
visibility data to do gridding and degridding. We use only one node and one GPU to run
the benchmark and report various performance metrics. All these variables can be
configured at the runtime which will be discussed in :ref:idg usage.
Environment variables
By default, the test will create a conda
environment and run inside it for the
sake of isolation. This can be controlled using env variable CREATE_CONDA_ENV
.
By setting it to NO
, the test WILL NOT create conda
environment.
Similarly, the performance metrics are monitored using the perfmon toolkit. If
the user does not want to monitor metrics, it can be achieved by setting
MONITOR_METRICS=NO
.
Usage
The tests can be run using the following commands:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level1/idg_test/reframe_idgtest.py --run --performance-report
If we want to change the variables to non default values, we should use -S
flag. For example, if we want to run only 5 major cycles and 64 frequency channels, use
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level1/idg_test/reframe_idgtest.py -S num_cycles=5 -S num_chans=64 --run --performance-report
Test class documentation
- class apps.level1.idg_test.reframe_idgtest.IdgTestDownload(*args, **kwargs)[source]
Fixture to fetch IDG source code
- class apps.level1.idg_test.reframe_idgtest.IdgTestBuild(*args, **kwargs)[source]
IDG test compile test
- class apps.level1.idg_test.reframe_idgtest.IdgTest(*args, **kwargs)[source]
Main class of IDG benchmark tests
- get_num_nodes()[source]
Get number of nodes from total cores requested and number of cores per node
- set_num_tasks_job()[source]
This method sets tasks for the job. We use this to override the num_tasks set for reservation. Using this approach we can set num_tasks to job in a more generic way
- pre_launch()[source]
Set prerun commands. It includes setting scratch directory and pre run commands from base class
- set_sanity_patterns()[source]
Set sanity patterns. Example stdout:
>>> Total runtime gridding: 6.5067e+02 s degridding: 1.0607e+03 s fft: 3.5437e-01 s get_image: 6.5767e+00 s imaging: 2.0073e+03 s >>> Total throughput gridding: 3.12 Mvisibilities/s degridding: 1.91 Mvisibilities/s imaging: 1.01 Mvisibilities/s
- extract_time(kind='gridding')[source]
Performance extraction function for time. Sample stdout:
>>> Total runtime gridding: 7.5473e+02 s degridding: 1.1090e+03 s fft: 3.5368e-01 s get_image: 7.2816e+00 s imaging: 1.8899e+03 s
Imaging IO Test
Context
This is a prototype exploring the capability of hardware and software to deal with the types of I/O loads that the SDP will have to support for full-scale operation on SKA1 (and beyond). The benchmark is written in plain C and uses MPI for communication. The source code is hosted on SKA GitLab repository and documentation can be found here.
Test parameterisation
Currently, the benchmark supports three different parameterisations namely, - Variant of benchmark - Number of cores - Size of benchmark
Within variant, three different tests are defined namely, dry, write and read tests. As name suggests dry test runs all the computations without writing the data to the disk. This benchmark can be used to assess the computational performance of the prototype. Write test does the computations and writes the data to the disk and hence, benchmarks the I/O performance of the underlying file system. And finally, read test reads the data that has been written to the disk and gives the read performance.
Size of the benchmark defines how big of the benchmark we would want to run. The size low-small
indicates small image for SKA1 LOW configuration, whereas low-large
is large image (96k) for
SKA1 LOW configuration. The same holds for mid-small
and mid-large
, albeit, for SKA1 MID,
large image size is 192k.
They are defined in the ReFrame test as follows:
variant = parameter(['dry-test', 'write-test', 'read-test'])
num_cores = parameter(1 << i for i in range(min, max))
size = parameter(['tiny', 'low-small', 'low-large', 'mid-small', 'mid-large'])
Environment variables
All these parameterisations are provided as tags to the ReFrame tests and hence, we can simply
choose which benchmark we want to run by specifying appropriate tags on command line. By default,
min
and max
used to parameterise num_cores
are 9 and 14, respectively. However, these
variables can be overridden using custom environment variables IMAGINGIOTEST_MIN
and
IMAGINGIOTEST_MAX
, respectively.
By default, the test will create a conda
environment and run inside it for the
sake of isolation. This can be controlled using env variable CREATE_CONDA_ENV
.
By setting it to NO
, the test WILL NOT create conda
environment.
Similarly, the performance metrics are monitored using the perfmon toolkit. If
the user does not want to monitor metrics, it can be achieved by setting
MONITOR_METRICS=NO
.
Test filtering
The tests can be run using the following commands:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level1/imaging_iotest/reframe_iotest.py --run --performance-report
But first let’s see the tests generated by ReFrame using --list
command as follows:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level1/imaging_iotest/reframe_iotest.py --list
The output is shown below:
[ReFrame Setup]
version: 3.10.0-dev.3+1407ae75
command: '/home/mpaipuri/benchmark-tests/main/reframe/bin/reframe -c apps/level1/imaging_iotest/reframe_iotest.py -l'
launched by: mpaipuri@fnancy.nancy.grid5000.fr
working directory: '/home/mpaipuri/benchmark-tests/main'
settings file: '/home/mpaipuri/benchmark-tests/main/reframe_config.py'
check search path: (R) '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py'
stage directory: '/home/mpaipuri/benchmark-tests/main/stage'
output directory: '/home/mpaipuri/benchmark-tests/main/output'
[List of matched checks]
- ImagingIOTest_read_test_low_small_32 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_mid_large_16 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_mid_small_64 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_mid_large_16 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_low_large_64 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_tiny_128 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_mid_small_8 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_low_large_64 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_low_large_32 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_mid_large_128 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_low_small_8 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_low_large_16 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_mid_small_16 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_low_small_32 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_low_small_16 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_mid_small_128 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_tiny_8 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_low_large_8 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_tiny_128 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_low_large_128 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_mid_small_8 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_tiny_32 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_low_small_128 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_mid_small_32 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_tiny_64 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_tiny_8 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_tiny_16 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_mid_large_8 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_mid_small_8 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_low_large_128 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_tiny_32 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_low_small_64 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_tiny_16 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_mid_large_16 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_mid_large_64 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_tiny_128 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_mid_large_64 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_mid_large_8 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_low_large_32 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_low_large_8 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_low_small_128 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_mid_large_32 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_low_small_32 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_low_large_16 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_low_large_64 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_low_large_8 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_mid_small_64 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_low_large_128 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_tiny_64 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_low_small_128 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_low_large_16 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_tiny_64 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_low_small_16 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_mid_small_32 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_mid_large_128 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_mid_large_128 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_mid_small_128 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_low_small_8 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_mid_large_32 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_low_small_64 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_mid_small_128 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_tiny_32 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_mid_large_64 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_low_large_32 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_low_small_8 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_mid_small_64 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_mid_small_16 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_mid_large_8 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_mid_small_32 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_mid_small_16 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_read_test_tiny_8 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_low_small_64 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_write_test_mid_large_32 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_low_small_16 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTest_dry_test_tiny_16 (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
- ImagingIOTestBuild (found in '/home/mpaipuri/benchmark-tests/main/apps/level1/imaging_iotest/reframe_iotest.py')
Found 76 check(s)
Log file(s) saved in '/home/mpaipuri/benchmark-tests/main/reframe.log', '/home/mpaipuri/benchmark-tests/main/reframe.out'
As we can see, there are a lot of 76 tests generated by ReFrame and they are for a given system partition. If we defined multiple partitions and environments, number of tests will be multiplied by number of partitions and environments. This is not very ideal (unless we have infinite resources to run these tests on) and usually we use test filtering to run specific tests.
For example, if we want to run dry-test
with 8 nodes and low-small
test case, following
commands must be used
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
IMAGINGIOTEST_MIN=3 IMAGINGIOTEST_MAX=4 reframe/bin/reframe -C reframe_config.py -c apps/level1/imaging-iotest/reframe_iotest.py --tag dry-test$ --tag low-small$ --run --performance-report
The environment variables IMAGINGIOTEST_MIN=3
and IMAGINGIOTEST_MAX=4
will generate one
test with 8 nodes for the reservation. Similarly, tags dry-test$
and low-small$
will
select only tests with those tags. We can also restrict the tests for a given partition using
--system
flag and programming environment with -p
flag.
Test class documentation
- class apps.level1.imaging_iotest.reframe_iotest.ImagingIOTestDownload(*args, **kwargs)[source]
Fixture to fetch Imaging IO test source code
- class apps.level1.imaging_iotest.reframe_iotest.ImagingIOTestBuild(*args, **kwargs)[source]
Imaging IO test compile test
- class apps.level1.imaging_iotest.reframe_iotest.ImagingIOTest(*args, **kwargs)[source]
Main class of Imaging IO runtime tests
- get_num_nodes()[source]
Get number of nodes from total cores requested and number of cores per node
- set_num_tasks_job()[source]
This method sets tasks for the job. We use this to override the num_tasks set for reservation. Using this approach we can set num_tasks to job in a more generic way
- pre_launch()[source]
Set prerun commands. It includes setting scratch directory and pre run commands from base class
- post_launch()[source]
Set post run commands. It includes removing visibility data files and running read-test
- set_sanity_patterns()[source]
Set sanity patterns. Example stdout:
# Fri Jul 23 15:15:41 2021[1,0]<stdout>:Operations:
We check number of time the above line is printed and compare it with number of sub grid workers
- extract_stream_time()[source]
Performance extraction function for stream time. Sample stdout:
Fri Jul 23 15:15:41 2021[1,2]<stdout>:Streamed for 3.53s
- extract_degrid_flop()[source]
Performance extraction function for degrid flop. Sample output:
# Fri Jul 23 15:15:41 2021[1,0]<stdout>: degrid 108.943 Gflops (10.9 GFlop/s, 92012544/92012544 visibilities, 1.47 GB, rate 0.15 GB/s, 25432 chunks)
- extract_degrid_flops()[source]
Performance extraction function for degrid flops. Sample output:
# Fri Jul 23 15:15:41 2021[1,0]<stdout>: degrid 108.943 Gflops (10.9 GFlop/s, 92012544/92012544 visibilities, 1.47 GB, rate 0.15 GB/s, 25432 chunks)
- extract_degrid_rate()[source]
Performance extraction function for degrid rate. Sample output:
# Fri Jul 23 15:15:41 2021[1,0]<stdout>: degrid 108.943 Gflops (10.9 GFlop/s, 92012544/92012544 visibilities, 1.47 GB, rate 0.15 GB/s, 25432 chunks)
- extract_fft_flop()[source]
Performance extraction function for fft flop. Sample output:
# Fri Jul 23 15:15:41 2021[1,0]<stdout>: FFTs 3.296 Gflop (0.3 Gflop/s)
- extract_fft_flops()[source]
Performance extraction function for fft flops. Sample output:
# Fri Jul 23 15:15:41 2021[1,0]<stdout>: FFTs 3.296 Gflop (0.3 Gflop/s)
- extract_write_bw()[source]
Performance extraction function for write bandwidth. Sample output:
# Fri Jul 23 15:15:41 2021[1,2]<stdout>:Writer 2: Wait: 2.02671s, Read: 0.158911s, Write: 1.28306s, Idle: 0.0658979s
- extract_read_bw()[source]
Performance extraction function for read bandwidth. Sample output:
# Fri Jul 23 15:15:41 2021[1,2]<stdout>:Writer 2: Wait: 2.02671s, Read: 0.158911s, Write: 1.28306s, Idle: 0.0658979s
for dry tests, read time will be zero. To avoid ZeroDivisionError, we add a small threshold value
Level 2 Benchmark Tests
RASCIL
Context
The Radio Astronomy Simulation, Calibration and Imaging Library expresses radio interferometry calibration and imaging algorithms in python and numpy. The interfaces all operate with familiar data structures such as image, visibility table, gain table, etc. The source code is hosted on SKA GitLab repository and documentation can be found here.
Test parameterisation
Currently, the benchmark supports three different parameterisations namely, - Number of nodes - Size of benchmark
Size of the benchmark defines how big of the benchmark we would want to run. In this case, size refers to number of frequency channels we are going to use to make the continuum image. Finally, scalability of the test can be specified using number of nodes. They are defined in the ReFrame test as follows:
size = parameter(['small', 'large', 'very-large', 'huge'])
num_nodes = parameter(1 << i for i in range(int(min), int(max)))
Environment variables
By default, the variables min
and max
are defined as 3 and 7, respectively. So, this
parameterisation creates tests with number of nodes ranging from 8 to 64 in multiples of 2. The
user can override the min
and max
variables using custom test environment variables
RASCILTEST_MIN
and RASCILTEST_MAX
, respectively. If any of these environment variables
are set, they will take precedence over default values. All these parameterisations are provided
as tags to the ReFrame tests and hence, we can simply choose which benchmark we want to run by
specifying appropriate tags on command line.
By default, the test will create a conda
environment and run inside it for the
sake of isolation. This can be controlled using env variable CREATE_CONDA_ENV
.
By setting it to NO
, the test WILL NOT create conda
environment.
Similarly, the performance metrics are monitored using the perfmon toolkit. If
the user does not want to monitor metrics, it can be achieved by setting
MONITOR_METRICS=NO
.
Usage
The tests can be run using the following commands:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level2/rascil/reframe_rascil.py --run --performance-report
But first let’s see the tests generated by ReFrame using --list
command as follows:
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c apps/level2/rascil/reframe_rascil.py --list
The output is shown below:
[ReFrame Setup]
version: 3.8.0-dev.2+8a9ceeda
command: 'reframe/bin/reframe -C reframe_config.py -c apps/level2/rascil/reframe_rascil.py -l'
launched by: mpaipuri@fnancy
working directory: '/home/mpaipuri/ska-sdp-benchmark-tests'
settings file: 'reframe_config.py'
check search path: (R) '/home/mpaipuri/ska-sdp-benchmark-tests/apps/level2/rascil/reframe_rascil.py'
stage directory: '/home/mpaipuri/ska-sdp-benchmark-tests/stage'
output directory: '/home/mpaipuri/ska-sdp-benchmark-tests/output'
[List of matched checks]
- RascilTest_small_100 (found in '/home/mpaipuri/ska-sdp-benchmark-tests/apps/level2/rascil/reframe_rascil.py')
- RascilTest_large_16 (found in '/home/mpaipuri/ska-sdp-benchmark-tests/apps/level2/rascil/reframe_rascil.py')
- RascilTest_large_50 (found in '/home/mpaipuri/ska-sdp-benchmark-tests/apps/level2/rascil/reframe_rascil.py')
- RascilTest_large_100 (found in '/home/mpaipuri/ska-sdp-benchmark-tests/apps/level2/rascil/reframe_rascil.py')
- RascilAndDatasetDownloadTest (found in '/home/mpaipuri/ska-sdp-benchmark-tests/apps/level2/rascil/reframe_rascil.py')
- RascilTest_small_25 (found in '/home/mpaipuri/ska-sdp-benchmark-tests/apps/level2/rascil/reframe_rascil.py')
- RascilTest_large_25 (found in '/home/mpaipuri/ska-sdp-benchmark-tests/apps/level2/rascil/reframe_rascil.py')
- RascilTest_large_8 (found in '/home/mpaipuri/ska-sdp-benchmark-tests/apps/level2/rascil/reframe_rascil.py')
- RascilTest_small_16 (found in '/home/mpaipuri/ska-sdp-benchmark-tests/apps/level2/rascil/reframe_rascil.py')
- RascilTest_small_50 (found in '/home/mpaipuri/ska-sdp-benchmark-tests/apps/level2/rascil/reframe_rascil.py')
- RascilTest_small_8 (found in '/home/mpaipuri/ska-sdp-benchmark-tests/apps/level2/rascil/reframe_rascil.py')
- RascilBuildTest (found in '/home/mpaipuri/ska-sdp-benchmark-tests/apps/level2/rascil/reframe_rascil.py')
Found 12 check(s)
Log file(s) saved in '/home/mpaipuri/ska-sdp-benchmark-tests/reframe.log', '/home/mpaipuri/ska-sdp-benchmark-tests/reframe.out'
As we can see, there are a lot of 12 tests generated by ReFrame and they are for a given system partition. If we defined multiple partitions and environments, number of tests will be multiplied by number of partitions and environments.
For example, if we want to run small
test with 8 nodes, following commands must be used
cd ska-sdp-benchmark-tests
conda activate ska-sdp-benchmark-tests
reframe/bin/reframe -C reframe_config.py -c aapps/level2/rascil/reframe_rascil.py --tag dry-test --tag 8$ --tag small --run --performance-report
Test class documentation
- class apps.level2.rascil.reframe_rascil.RascilAndDatasetDownloadTest(*args, **kwargs)[source]
Fetch RASCIL sources and datasets
- class apps.level2.rascil.reframe_rascil.RascilTest(*args, **kwargs)[source]
Main class of RASCIL runtime tests
- set_sanity_patterns()[source]
Set sanity patterns. When RASCIL finishes the job successfully, it creates image files in fits format. We check if the files are created as sanity check
- extract_times(var='create_blockvisibility_from_ms')[source]
Generic performance extraction function to extract time
- extract_wall_time()[source]
Performance function to extract wall time. Sample output:
# 26/07/2021 05:41:22 PM.110 rascil-logger INFO Started : 2021-07-26 13:52:07.487374 # 26/07/2021 05:41:22 PM.110 rascil-logger INFO Finished : 2021-07-26 17:41:22.110077