Framework philosophy

Adding new tests

To add a new test to the benchmark suite, follow the following steps: 1. Define whether the test belongs into level 0, level 1 or level 2. Then create a folder in the corresponding location and add the following files:

  • reframe_<test_name>.py: This is the main test file where we define the test class derived from the Benchmark base classes BenchmarkBase, RunOnlyBenchmarkBase or CompileOnlyBenchmarkBase. Make sure to use the one corresponding to your specific needs. Those classes are developed according to the ReFrame ideas documented in https://reframe-hpc.readthedocs.io/en/stable/tutorial_advanced.html#writing-a-run-only-regression-test.

  • TEST_NAME.ipynb: Jupyter notebook to plot the performance metrics derived from the test

  • README.md: A simple readme file that gives high level instructions on where to find the documentation of the test.

  1. Define the test procedure:

    • Does the test need some sources or packages from the internet, be it its own sources, python packages or any other dependencies? If yes, create a test dependency that fetches everything. Every test or test dependency that accesses the internet needs to inherit from the UsesInternet mixin.

    
    
    class IdgTestDownload(FetchSourcesBase):
        """Fixture to fetch IDG source code"""
        
        descr = 'Fetch source code of IDG'
    
    • Does the test need to compile fetched dependencies? If yes, create a test dependency that builds the sources. If the sources are fetched in a previous test, be sure to include this as a dependent fixture: app_src = fixture(DownloadTest, scope='session').

    
    
    class IdgTestBuild(CompileOnlyBenchmarkBase):
        """IDG test compile test"""
        
        descr = 'Compile IDG test from sources'
        
        # Share resource from fixture
        idg_test_src = fixture(IdgTestDownload, scope='session')
    
        def __init__(self):
            super().__init__()
            self.valid_prog_environs = [
                'idg-test',
            ]
            self.valid_systems = filter_systems_by_env(self.valid_prog_environs)
            self.maintainers = [
                'Mahendra Paipuri (mahendra.paipuri@inria.fr)'
            ]
            # Cross compilation is not possible on certain g5k clusters. We force
            # the job to be non-local so building will be on remote node
            if 'g5k' in self.current_system.name:
                self.build_locally = False
        
        @run_before('compile')
        def set_sourcedir(self):
            """Set source path based on dependencies"""
            self.sourcesdir = self.idg_test_src.stagedir
            
        @run_before('compile')
        def set_prebuild_cmds(self):
            """Make local lib dirs"""
            self.lib_dir = os.path.join(self.stagedir, 'local')
            self.prebuild_cmds = [
                f'mkdir -p {self.lib_dir}',
            ]
    
        @run_before('compile')
        def set_build_system_attrs(self):
            """Set build directory and config options"""
            self.build_system = 'CMake'
            self.build_system.builddir = os.path.join(self.stagedir, 'build')
            self.build_system.config_opts = [
                f'-DCMAKE_INSTALL_PREFIX={self.lib_dir}',
                '-DBUILD_LIB_CUDA=ON',
                '-DPERFORMANCE_REPORT=ON',
            ]
            self.build_system.max_concurrency = 8
            
        @run_before('compile')
        def set_postbuild_cmds(self):
            """Install libs"""
            self.postbuild_cmds = [
                'make install',
            ]
    
        @run_before('sanity')
    
  2. Write the test itself.

    • Define all dependencies as fixture, all parameters as parameter and all variables as variable. Tests are run for all permutations of parameters, whereas variables can define specific behaviour for a single run (like number of nodes).

    • Set the valid_prog_environs and the valid_systems in the __init__ method.

    • Define the executable and executable options.

    • Define the Sanity Patterns. You can define which patterns must and must not appear in the stdout and stderr.

                f'cd {self.executable_path}',
            ]
    
        @run_before('run')
        def post_launch(self):
            """Set post run commands. It includes removing visibility data files and running
            read-test"""
            if self.variant == 'write-test':
                self.postrun_cmds.insert(0, 'rm -rf $SCRATCH_DIR_TEST/out%d.h5')
            elif self.variant == 'read-test':
                # run_command returns with default num tasks which is 1
                cmd = self.job.launcher.run_command(self.job).replace(str(1), str(self.num_tasks_job))
                # The first job writes the visibility data and this job reads them back.
                # This gives us the read bandwidth benchmark
                exec_opts = " ".join([*self.executable_opts[:-1], '--check-existing',
                                      self.executable_opts[-1]])
                self.postrun_cmds.insert(0, f'{cmd} {self.executable} {exec_opts}')
    
    • Define the Performance Functions. To extract data from the output stream it is necessary to extract them using regular expressions.

            """Set sanity patterns. Example stdout:
    
            .. code-block:: text
    
                >>> Total runtime
                gridding:   6.5067e+02 s
                degridding: 1.0607e+03 s
                fft:        3.5437e-01 s
                get_image:  6.5767e+00 s
                imaging:    2.0073e+03 s
    
                >>> Total throughput
                gridding:   3.12 Mvisibilities/s
                degridding: 1.91 Mvisibilities/s
                imaging:    1.01 Mvisibilities/s
    
            """
            self.sanity_patterns = sn.all([
                sn.assert_found('Total runtime', self.stderr),
                sn.assert_found('Total throughput', self.stderr),
            ])
    
        @performance_function('s')
        def extract_time(self, kind='gridding'):
            """Performance extraction function for time. Sample stdout:
    
    
            .. code-block:: text
    
                >>> Total runtime
                gridding:   7.5473e+02 s
                degridding: 1.1090e+03 s
                fft:        3.5368e-01 s
                get_image:  7.2816e+00 s
                imaging:    1.8899e+03 s
                
            """
            return sn.extractsingle(rf'^{kind}:\s+(?P<value>\S+) s', self.stderr, 'value', float)
        
        @performance_function('Mvisibilities/s')
        def extract_vis_thpt(self, kind='gridding'):
            """Performance extraction function for visibility throughput. Sample stdout:
    
    
            .. code-block:: text
    
                >>> Total throughput
                gridding:   2.69 Mvisibilities/s
                degridding: 1.83 Mvisibilities/s
                imaging:    1.07 Mvisibilities/s
    
            """
            return sn.extractsingle(rf'^{kind}:\s+(?P<value>\S+) Mvisibilities/s', self.stderr, 'value', float)
    
        @run_before('performance')
        def set_perf_patterns(self):
            """Set performance variables"""
            self.perf_variables = {
                'gridding s': self.extract_time(),
                'degridding s': self.extract_time(kind='degridding'),
                'fft s': self.extract_time(kind='fft'),
                'get_image s': self.extract_time(kind='get_image'),
                'imaging s': self.extract_time(kind='imaging'),
                'gridding Mvis/s': self.extract_vis_thpt(),
                'degridding Mvis/s': self.extract_vis_thpt(kind='degridding'),
                'imaging Mvis/s': self.extract_vis_thpt(kind='imaging'),
            }
    
        @run_before('performance')
        def set_reference_values(self):
            """Set reference perf values"""
            self.reference = {
                '*': {
                    '*': (None, None, None, 's'),
                    '*': (None, None, None, 'Mvis/s'),
                }
            }
    

The sanity- and performance functions are both based on the concept of “Deferrable Functions”. Be sure to check out the official documentation on how to use them properly.

Those steps allow you to write a basic ReFrame test. For more in-detail view, take a look at the ReFrame documentation. There is no strict convention on how to name the test. Already provided tests can be used as templates to write new tests. The idea is to provide an environment for a given test and define all the test related variables like modules to load, environment variables to define within this environment. Also, we need to add the target_systems to this environments on the systems that we would like to run these tests. The details of adding a new environment and system are presented below.

  1. Write a unit test procedure. You can take for example the test class for CondaEnvManager located in unittests/test_conda_env_manager.py.

    • Create a file test with a class test for your benchmark :

      import unittest
      import modules.conda_env_manager as cem
      import os
      import logging
      
      # INFO ERROR DEBUG WARNING
      # logging.basicConfig(level=logging.DEBUG)
      LOGGER = logging.getLogger(__name__)
      
      
      class CondaEnvManagerTest(unittest.TestCase):
      
    • Add a test method for each functionnality :

          def test_create(self):
              logging.info("CondaEnvManagerTest")
              logging.info("--------------------\n")
              logging.info("test_create")
              test = cem.CondaEnvManager("test_create", False)
              test.remove()
              self.assertTrue(test.create())
              self.assertFalse(test.create())
              test.remove()
      
    • Check results via the self.assertTrue and self.assertFalse methods (unittest doc <https://docs.python.org/3/library/unittest.html>) :

              self.assertTrue(test.create())
              self.assertFalse(test.create())
      
    • Add the name of your unit test file (without the extension) in the list of unit tests to check in the sdp-unittest.sh script

              self.assertTrue(test.create())
              self.assertFalse(test.create())
      

We recommand to use systematically the logging module (logging doc <https://docs.python.org/3/library/logging.html>) which provides 6 custom level of screen outputs (e.g. CRITICAL, ERROR, WARNING, INFO, DEBUG, NOSET).

To perform our unit test procedure, we provide an experimental bash script sdp-unittest.sh to launch our different unit tests automatically with several available options. We use the package pytest (pytest doc <https://docs.pytest.org/en/7.1.x/contents.html>) to purpose a parallelized and efficient procedure for the unit test step.

Adding new system

Every time we want to add a new system, typically we will need to follow these steps:

  • Create a new python file <system_name>.py in config/systems folder.

  • Add system configuration and define partitions for the system. More details on how to define a partition and naming conventions are presented later.

  • Import this file into reframe_config.py and add this new system in the site_configuration.

  • The final step would be get the processor info using --detect-host-topology option on ReFrame of system nodes, place in the toplogies folder and include the file in processor key for each partition.

The user is advised to consult the ReFrame documentation before doing so. The provided systems can be used as a template to add new systems.

We try to follow a certain convention in defining the system partition. Firstly, we define partitions, either physical or abstract, based on compiler toolchain and MPI implementation such that when we use this system, modules related to compiler and MPI will be loaded. Rest of the modules that are related to test will be added to the environs which will be discussed later. Consequently, we should also name these partitions in such a way that we can have a standard scheme. The benefit of having such a scheme is two-fold: able to get high level overview of partition quickly and by choosing an appropriate names, we can filter the systems for the tests easily. An example use case is that we want to run a certain test on all partitions that support GPUs. Using a partition name with gpu as suffix, we can simply filter all the partitions looking for a match with string gpu.

We use the convention {prefix}-{compiler-name-major-ver}-{mpi-name-major-ver}-{interconnect-type}-{software-type}-{suffix}.

  • Prefix can be name of the partition or cluster.

  • compiler-name-major-ver can be as follows:
    • gcc9: GNU compiler toolchain with major version 9

    • icc20: Intel compiler toolchain with major version 2020

    • xl16: IBM XL toolchain with major version 16

    • aocc3: AMD AOCC toolchain with major version 3

  • mpi-name-major-ver is the name of the MPI implementation. Some of them are:
    • ompi4: OpenMPI with major version 4

    • impi19: Intel MPI with major version 2019

    • pmpi5: IBM Spectrum MPI with major version 5

    • smpi10: IBM Spectrum MPI with major version 10

  • interconnect-type is type of interconnect on the partition.
    • ib: Infiniband

    • rocm: RoCE

    • opa: Intel Omnipath

    • eth: Ethernet TCP

  • software-type is type of software stack used.
    • smod: System provided software stack

    • umod: User built software stack using Spack

  • suffix can indicate any special properties of the partitions like gpu, high memory nodes, high priority job queues, etc. There can be multiple suffices each separated by a hyphen.

Important

If the package uses calendar versioning, we use only last two digits of the year in the name to be concise. For example, Intel MPI 2019.* would be impi19.

For instance, in the configuration shown in ReFrame configuration compute-gcc9-ompi4-roce-umod tells us that the partition has GCC compiler with OpenMPI. It uses RoCE as interconnect and the softwares are built in user space using Spack.

Important

It is recommended to stick to this convention and there can be more possibilities for each category which should be added as we add new systems.

Adding new environment

Adding a new system is not enough to run the tests on this system. We need to tell our ReFrame tests that there is a new system available in the config. In order to minimise the redundancy in adding configuration details and avoid modifying the source code of the test, we choose to provide a environ for each test. For example, there is HPL test in apps/level0/hpl folder and for this test we define a environ in config/environs/hpl.py.

Note

System partitions and environments should have one-to-one mapping. It means, whatever environment we define within environs section in the system partition, we should put that partition within target_systems in each environ.

All the modules that are needed to run the test, albeit compiler and MPI, will be added to the modules section in each environ. For example, lets take a look at hpl.py file

""""This file contains the environment config for HPL benchmark"""


hpl_environ = [
    {
        'name': 'intel-hpl',
        'cc': 'mpicc',
        'cxx': 'mpicxx',
        'ftn': 'mpif90',
        'modules': [
            'intel-oneapi-mkl/2021.3.0',
        ],
        'variables':[
            ['XHPL_BIN', '$MKLROOT/benchmarks/mp_linpack/xhpl_intel64_dynamic'],
        ],
        'target_systems': [
            'alaska:compute-icc21-impi21-roce-umod',
            # <end - alaska partitions>
            'grenoble-g5k:dahu-icc21-impi21-opa-umod',
            # <end - grenoble-g5k partitions>
            'juwels-cluster:batch-icc21-impi21-ib-umod',
            # <end juwels partitions>
            'nancy-g5k:gros-icc21-impi21-eth-umod',
            # <end - nancy-g5k partitions>
            'cscs-daint:daint-icc21-impi21-ib-umod-gpu',
            # <end - cscs partitions>
        ],
    },
    {
        'name': 'gnu-hpl',
        'cc': '',
        'cxx': '',
        'ftn': '',
        'modules': [
            'amdblis/3.0',
        ],
        # 'variables': [
        #     ['UCX_TLS', 'ud,rc,dc,self']
        # ],
        'target_systems': [
            'juwels-booster:booster-gcc9-ompi4-ib-umod',
            # <end juwels partitions>
        ],
    },
{
        'name': 'intel-hpl',
        'cc': 'mpicc',
        'cxx': 'mpicxx',
        'ftn': 'mpif90',
        'modules': [
            'intel-oneapi-mkl/2023.0.0',
        ],
        'variables':[
            ['XHPL_BIN', '$MKLROOT/benchmarks/mp_linpack/xhpl_intel64_dynamic'],
        ],
        'target_systems': [
            'cscs-daint:daint-icc21-impi21-ib-umod-gpu',
            # <end - cscs partitions>
        ],
    },
]

There are two different environments namely intel-hpl and gnu-hpl. As names suggests, intel-hpl uses HPL benchmark shipped out of MKL optimized for Intel chips. Whereas we use gnu-hpl for other chips like AMD using GNU toolchain. Notice that target_systems for intel-hpl has only partitions that have Intel MPI implementation (impi in the name) whereas the gnu-hpl has target_systems have OpenMPI implementation. Within the test, we define only the valid program and find valid systems by filtering all systems that have the given environment defined for them.

For instance, we defined a new system partition that has Intel chip with name as mycluster-gcc-impi-ib-umod. If we want HPL test to run on this system partition, we add intel-hpl to environs section in system partition and similarly add the name of this partition to target_systems in intel-hpl environment. Once we do that, the test will run on this partition without having to modify anything in the source code of the test.

If we want to add a new test, we will need to add new environment and following steps should be followed:

  • Create a new file <env_name>.py in config/environs folder and add environment configuration for the tests. It is important that we add this new environment to existing and/or new system partitions that are defined in target_systems of the environment configuration.

  • Finally, import <env_name>.py in the main ReFrame configuration file reframe_config.py and add it to the configuration dictionary.

Adding Spack configuration test

After we define a new system, we need software stack on this system to be able to run tests on it. If the user chooses to use platform provided software stack, this step can be skipped. We need to define Spack config files in order to deploy the software stack. We can user existing config files provided for different systems as a base. Typically, we should only change compilers.yml, modules.yml and spack.yml files for new system. We need to update the system compiler version and their paths in compilers.yml and also in modules.yml file in core_compilers section. Similarly, the desired software stack that will be installed on the system is defined in spack.yml file.

Once these configuration files are ready, we need to create a new folder in spack/spack_tests folder with name of the system and place all configuration files in configs/ and define a ReFrame test to deploy this software stack. The user can use the existing test files as template. The ReFrame test file per se is very minimal and user needs to put the name of the cluster and path where Spack must be installed in the test body.

Adding Conda environment via CondaEnvManager

CondaEnvManager is a class to provide a specific conda environment for each test (benchmark) we want to create. The advantage is we remove the different problem of dependancies we could have with the ska-sdp-benchmark environment (e.g. a different version of numpy package). For the moment, we provide this approach only for the level0/cpu/fft test and for the associated unit test. We will use the unit test example as a basis below.

Note : The self.assertTrue() and self.assertFalse() are just unit test methods.

Enable the module

import modules.conda_env_manager as cem

Create a conda environment

When you instance your class, you don’t create directly a conda environement with the name you provide :
        test = cem.CondaEnvManager("test_remove", False)
        test.create()

Remove a conda environment

Remove the created environment and all packages installed.
        test.remove()

Install a package via pip

The install_pip method provide a generic method which is able to download and install directly a package or install from a local folder a package if the package already exists (yet downloaded). You can also provide a requirement.txt file directory to install directly multiple packages (e.g. Download and install manually a list of packages via pip).
        test = cem.CondaEnvManager("test_install_pip_package", False)
        test.create()
        self.assertTrue(test.install_pip("zipp"))

Download locally a package via pip

Download locally in a cache folder packages by environment. It is usefull if you have not a connection on a calculation cluster :
        test = cem.CondaEnvManager("test_download_install_local_pip_package", False)
        test.create()
        self.assertTrue(test.download_pip("zipp"))

Download and install locally a list of packages via pip

We can download via a requirement.txt file a list a packages for an environment. Need to provide the directory of your requirement file for the download and installations steps :
        test = cem.CondaEnvManager("test_install_download_local_pip_file", False)
        test.create()
        root_dir = os.path.dirname(__file__)
        requirements_true = os.path.join(root_dir, 'resources', 'requirements.txt')
        requirement_false = os.path.join(root_dir, 'resources', 'requirements_false.txt')
        self.assertTrue(test.download_pip(requirements_true))
        self.assertFalse(test.download_pip(requirement_false))
        self.assertTrue(test.install_pip(requirements_true, True))
        self.assertFalse(test.install_pip(requirement_false, True))

Purge the pip cache

Remove packages in the cache folder for a created environment.

Separation of Parts that Access the Internet

Any test or test dependency that accesses the internet needs to implement the mixin UsesInternet. This mixin ensures that parts that need to download assets from the internet are run on a partition where internet can be accessed. Any test that accesses the internet and does not implement the UsesInternet mixin is not guaranteed to have internet available and might fail.

Best practice is to separate all dependencies that need to access the web into dependencies, be it dataset downloads, sources downloads or package installations that fetch from repositories. Those dependencies must implement the UsesInternet mixin and should be defined as fixtures for the test.