Documentation of Platform Scripts¶
SDP Benchmark Suite¶
The following sections provide the context, prerequisites and usage of the benchmark suite.
About¶
The aim of this package is to create a SDP benchmark suite with the available prototype pipelines that can be tested on different production HPC machines and hardwares. The development of a benchmark suite of this kind was proposed in SKA Computing Hardware Risk Mitigation Plan (SKA-TEL-SKO-0001083). This package automates the deployment of benchmarks, i.e., from compiling the code to parsing the output to get relevant metrics. The package is developed using modular approach and hence, as more prototype codes become available in the future, they can be readily integrated into this benchmark suite.
Available Benchmarks¶
Currently, the benchmark suite contains only the imaging IO test code developed by Peter Wortmann. More benchmark pipelines will be added to the suite in the future.
Imaging IO test¶
The aim of this section is to give a high level overview of what imaging IO code does. The input to the code is a sky image and output is the visibility data. The code currently implements only the “predict” part of the imaging algorithm. More details can be found here. There are several input parameters to the benchmark, which can be found in the documentation of the code. More details about the code and algorithms can be found in this memo.
Prerequisites¶
The following prerequisites must be installed to use benchmark suite:
python >= 3.6
singularity
git
OpenMPI (with threading support)
For the Imaging IO test, we should have following dependencies installed.
cmake
HDF5 (doesn’t need to have parallel support)
FFTW3
The benchmark suite will not install any missing dependencies and therefore it is necessary to install or load all the dependencies before running benchmark suite.
The benchmark suite is tested on both Linux and MacOS. It has not been tested on Windows and there is not guarantee that it will work on Windows systems.
On the MacOS, homebrew
can be used to install all the listed
dependencies.
Installation¶
Currently, the suite is not packaged into a python library. So, the only way to use is to clone the git repository and running the main script.
To set up the repository and get configuration files:
git clone https://gitlab.com/ska-telescope/platform-scripts
cd ska-sdp-benchmark-suite
To install all the required python modules
pip3 install --user -r requirements.txt
If you would like to run unit tests as well, we should install the python modules in requirements-tests.txt
too.
Configuration Files¶
The benchmark suite uses configuration files to define various options. The configuration file is divided into two different pieces to abstract away the system based configuration from benchmark related configuration. Samples of these two configuration files can be found in the repository.
The default benchmark configuration file defines global settings and benchmark related settings. The global settings can be overridden by the CLI. More details on the global settings are presented in Usage section of the documentation.
The system dependent configuration file defines all the relevant settings based on the system the benchmark suite is running on. It includes names of the modules to load, MPI specific settings like mca parameters, interconnect, etc. and batch scheduler settings. Often these settings are defined once for the system and they do not have to be changed after.
It is recommended to copy this file to home directory of the user and modify the config file according to the system settings. By default the benchmark suite reads this config file from project directory (platform-scripts/ska-sdp-benchmark-suite/sdpbenchmarks/config/.ska_sdp_bms_system_config.yml
) and user home directory ($HOME/.ska_sdp_bms_system_config.yml
). The config file in the home directory has the precedence over the project directory. Different options in the configuration file are explained in the comments in each of the default files present in the repository.
Usage¶
The main script to run the benchmark suite is sdpbenchrun.py
that can be found in ska-sdp-benchmark-suite/scripts
. The suite can be launched with a default configuration using following command:
python3 sdpbenchrun.py
This will run the default config file.
To get all the available options, we should run python3 sdpbenchrun.py --help
. This gives you following output:
usage: sdpbenchrun.py [-h] [-r [RUN_NAME]] [-b BENCHMARKS [BENCHMARKS ...]]
[-c [CONFIG]] [-m RUN_MODE [RUN_MODE ...]]
[-n NUM_NODES [NUM_NODES ...]] [-d [WORK_DIR]]
[-t [SCRATCH_DIR]] [-s] [-p [REPETITIONS]] [-o] [-l]
[-e] [-a] [-v] [--version]
SKA SDP Pipeline Benchmark Tests
optional arguments:
-h, --help show this help message and exit
-r [RUN_NAME], --run_name [RUN_NAME]
Name of the run
-b BENCHMARKS [BENCHMARKS ...], --benchmarks BENCHMARKS [BENCHMARKS ...]
List of benchmarks to run.
-c [CONFIG], --config [CONFIG]
Configuration file to use. If not provided, load
default config file.
-m RUN_MODE [RUN_MODE ...], --run_mode RUN_MODE [RUN_MODE ...]
Run benchmarks in container and/or bare metal.
-n NUM_NODES [NUM_NODES ...], --num_nodes NUM_NODES [NUM_NODES ...]
Number of nodes to run the benchmarks.
-d [WORK_DIR], --work_dir [WORK_DIR]
Directory where benchmarks will be run.
-t [SCRATCH_DIR], --scratch_dir [SCRATCH_DIR]
Directory where benchmark output files will be stored.
-s, --submit_job Run the benchmark suite in job submission mode.
-p [REPETITIONS], --repetitions [REPETITIONS]
Number of repetitions of benchmark runs.
-o, --randomise Randomise the order of runs.
-l, --ignore_rem In case of unfinished jobs, skip remaining jobs on
relaunch.
-e, --export Export all stdout, json and log files from run_dir and
compresses them.
-a, --show Show running config and exit.
-v, --verbose Enable verbose mode. Display debug messages.
--version show program's version number and exit
By default all the CLI arguments can be provided in the configuration file as well. However, the options passed by CLI will override the options given in the configuration file.
Arguments¶
If not --run_name
argument is provided, by default, benchmark suite creates a name using hostname and current timestamp. It is advised to give a run name to better organise different benchmark runs. User-defined configuration files can be provided using --config
option. Note that system specific user configuration should be placed in the home directory of the user as suggested in Configuration Files. The benchmark suite supports running in the batch submission mode and it can be invoked by --submit_job
option. All the benchmark related files are placed in the directory specified to --work_dir
option. This includes source codes, compilation files, std out files, etc. Currently only imaging IO test is included in the benchmark suite and depending on the configuration used, it will generate huge amount of data. --scratch_dir
can be used to write this kind of data. Important that users must have permissions to read and write to both --work_dir
and --scratch_dir
. It is also important to note that fastest file system available to the machine should be used as --scratch_dir
as we are interested in high I/O bandwidths. --repetitions
specify number of times each experiment is repeated. Similarly, using --randomise
option randomises the order of experiments. This helps to minimise the cache effects.
Example use cases¶
To run with user defined config path/to/my_config.yml
and giving a run name trial_run
python3 sdpbenchrun.py -r trial_run -c path/to/my_config.yml
To override run mode in the default config file and run with both bare-metal and container modes
python3 sdpbenchrun.py -m bare-metal singularity
Multiple inputs to an argument must be delimited by the space.
To run the benchmark with 2, 4 and 8 nodes
python3 sdpbenchrun.py -n 2 4 8
Note that in such case it is better to use batch submission mode with -s
option. Else, we will have to make a reservation for 8 nodes and for certain runs, we will use only 2 or 4 nodes and rest of the nodes will be ideal. When -s
option is added, the benchmark suite will essentially submit jobs to scheduler based on the configuration provided in the system config file. Currently, only SLURM and OAR batch schedulers are supported.
To export the compressed files of the results, use
python3 sdpbenchrun.py -e
In the similar way, multiple parameters can be defined for the benchmark configuration. More details on how to provide them is documented in the default configuration file.
Based on the provided configuration files and CLI arguments, a parameter space is created and benchmark runs with different combinations in the parameter space are realised.
Job submission mode¶
By default the benchmark suite runs in interactive mode, which means the benchmarks are run on the current node. It can be run using SLURM too as follows:
#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH -J sdp-benchmarks
#SBATCH --no-requeue
#SBATCH --exclusive
module load gnu7 openmpi3 hdf5
workdir=$PWD
# Make sure we have the right working directory
cd $workdir
echo -e "JobID: $SLURM_JOB_ID\n======"
echo "Time: `date`"
echo "Running on master node: `hostname`"
echo "Current directory: `pwd`"
echo "Output directory: $outdir"
echo -e "\nnumtasks=$SLURM_NTASKS, numnodes=$SLURM_JOB_NUM_NODES, OMP_NUM_THREADS=$OMP_NUM_THREADS"
CMD="python3 sdpbenchrun.py -n $SLURM_JOB_NUM_NODES"
echo -e "\nExecuting command:\n==================\n$CMD\n"
eval $CMD
echo -e "\n=================="
echo "Finish time: `date`"
This script will run the benchmark suite with default configuration and number of nodes in the SLURM reservation. Sometimes, it is not convenient to run the benchmarks in this way. For example, if we want to do scalability tests, we need to run benchmark with different number of nodes. In this case, one option is to submit several SLURM jobs for different number of nodes. However, this can be done even more in a streamlined fashion. Lets say we want to run on 2, 4, 8 ,16 and 32 nodes. By invoking job submission mode, we can simply use
python3 sdpbenchrun.py -r scalability_test -n 2 4 8 16 32 -s
on the login node. This will submit SLURM job scripts and we are naming our test run as scalability_run
. Once all the jobs are finished, if we we rerun the same command again, the benchmark suite will parse the output from the benchmarks and save all the data. In case if not all the jobs have finished, the benchmark suite will collect the results from the jobs that are finished and show the state of unfinished jobs.
In case for some reason one of the jobs failed to finish successfully, the benchmark suite will mark this test as fail. If we run the benchmark suite with the same configuration, the benchmark has already information on which tests have successfully finished and which tests have failed. Eventually, it will run only the tests that have failed. This works both in interactive and job submission mode.
Collection of results¶
The test results are saved to the --work-dir
in JSON format. Typically, for iotest
, the results can be found at $WORK_DIR/iotest/out
. There will be two sub-directories under this directory named std-out
and json-out
. The std-out
contains the std output from the benchmark runs, whereas json-out
will have all the information about the benchmark run and relevant metrics parsed from the std out files.
Typical schema of the JSON output file is
{
"benchmark_name": <name of the benchmark test>,
"benchmark_args": {<arguments of the benchmark test>},
"batch_script_info": {<info about batch scheduler>},
"benchmark_metrics_info": {<metrics parsed from benchmark test>}
}
All the meta data about the benchmark experiment can be found in the JSON file. It is possible to reproduce the experiment with the data available in the JSON file.
API Documentation¶
The following sections provide the API documentation of different files of the benchmark suite.
SDP Benchmark Engine¶
This module runs the SDP benchmark codes
Imaging I/O test¶
This module contains the functions to run Imaging IO benchmark.
-
sdpbenchmarks.imagingiobench.
check_iotest_arguments
(conf)[source]¶ Checks the arguments passed in the config file
-
sdpbenchmarks.imagingiobench.
compile_imaging_iotest
(conf)[source]¶ Compiles Imaging IO test by cloning the code from Git
- Parameters
conf (dict) – A dict containing configuration.
- Returns
0 OK, 1 Not OK
- Return type
- Raises
ImagingIOTestError – An error occurred during compiling of the code
-
sdpbenchmarks.imagingiobench.
create_bench_conf
(tag, run_mode, num_nodes, rep, rec_set, vis_set, chunk_sizes)[source]¶ Creates a dict containing the parameters for a given run
- Parameters
- Returns
A dict containing all the parameters
- Return type
-
sdpbenchmarks.imagingiobench.
extract_metrics
(filename, mpi_startup)[source]¶ Extract data transfer metrics from benchmark output
-
sdpbenchmarks.imagingiobench.
get_command_to_execute_bench
(conf, param)[source]¶ This function forms the command string to be executed
-
sdpbenchmarks.imagingiobench.
get_mpi_args
(conf, num_nodes, num_omp_threads, num_processes)[source]¶ Extract all MPI specific arguments and form a string
-
sdpbenchmarks.imagingiobench.
get_num_processes
(conf, rec_set, num_nodes)[source]¶ Estimates producers, streamers OpenMP threads.
- Parameters
- Returns
total cpu cores (only physical) on all nodes combined, threads per each core, number of OpenMP threads, number of producers, number of MPI processes
- Return type
-
sdpbenchmarks.imagingiobench.
get_telescope_config_settings
(param)[source]¶ This function returns the telescope related configurations
-
sdpbenchmarks.imagingiobench.
prepare_iotest
(conf)[source]¶ Prepare IO Imaging Benchmark installation.
-
sdpbenchmarks.imagingiobench.
print_key_stats
(run_prefix, metrics)[source]¶ This prints the key metrics to stdout
Utility Functions¶
This module contains the utility functions.
-
class
sdpbenchmarks.utils.
ParamSweeper
(persistence_dir, params=None, name=None, randomise=True)[source]¶ This class is inspired from execo library (http://execo.gforge.inria.fr/doc/latest-stable/) except this is very simplified version of the original. The original one is developed for large scale experiments and thread safety. Here what we are interested is the state of each run that can be tracked and remembered when launching the experiments.
-
sdpbenchmarks.utils.
create_scheduler_conf
(conf, param, bench_name)[source]¶ Prepares a dict with parameters that will create a job submit
- Parameters
- Returns
A dict with parameters that need to submit a job file
- Return type
- Raises
KeyNotFoundError – An error occurred while looking for a key in conf or param
-
sdpbenchmarks.utils.
exec_cmd
(cmd_str)[source]¶ This method executes the given command
- Parameters
cmd_str (str) – Command to execute
- Returns
A subprocess.run output with stdout, stderr and return code in the object
- Raises
ExecuteCommandError – An error occurred during execution of command
-
sdpbenchmarks.utils.
execute_command_on_host
(cmd_str, out_file)[source]¶ This method executes the job on host
- Parameters
- Raises
ExecuteCommandError – An error occurred during execution of command
-
sdpbenchmarks.utils.
execute_job_submission
(cmd_str, run_prefix)[source]¶ This method submits to SLURM job scheduler and returns job ID
- Parameters
- Returns
ID of the submitted job or raises exception in case of failure
- Return type
- Raises
JobSubmissionError – An error occurred during the job submission
-
sdpbenchmarks.utils.
get_project_root
()[source]¶ Get root directory of the project
- Returns
Full path of the root directory
- Return type
-
sdpbenchmarks.utils.
get_sockets_cores
(conf)[source]¶ Returns the number of sockets and cores on the compute nodes. For interactive runs, lscpu can be used to grep the info. When using the script to submit jobs from login nodes, lscpu cannot be used and sinfo for a given partition is used.
- Parameters
conf (dict) – A dict containing configuration settings
- Returns
Number of sockets on each node, number of physical cores on each node (num_socekts * num_cores per socket), number of threads inside each core
- Return type
- Raises
KeyNotFoundError – An error occurred if key is not found in g5k dict that contains lscpu info for different clusters
-
sdpbenchmarks.utils.
load_modules
(module_list)[source]¶ This function purges the existing modules and loads given modules
- Parameters
module_list (str) – List of modules to load
-
sdpbenchmarks.utils.
log_failed_cmd_stderr_file
(output)[source]¶ This method dumps the output to a file when command execution fails
-
sdpbenchmarks.utils.
pull_image
(uri, container_mode, path)[source]¶ This pulls the image from the registry. It returns error if image is not pullable
-
sdpbenchmarks.utils.
reformat_long_string
(ln_str, width=70)[source]¶ This method reformats command string by breaking it into multiple lines.
-
sdpbenchmarks.utils.
standardise_output_data
(bench_name, conf, param, metrics)[source]¶ This method saves all the data of the benchmark run in json format. The aim is to put all the info tp be able to reproduce the run.
- Parameters
- Raises
KeyNotFoundError – An error occurred while looking for a key in conf or param
-
sdpbenchmarks.utils.
sweep
(parameters)[source]¶ This method accepts a dict with possible values for each parameter and creates a parameter space to sweep
-
sdpbenchmarks.utils.
which
(cmd, modules)[source]¶ This function loads the given modules and returns of path of the requested binary if found or None
-
sdpbenchmarks.utils.
write_oar_job_file
(conf_oar)[source]¶ This method writes a OAR job file to submit with sbatch
- Parameters
conf_oar (dict) – A dict containing all OAR job parameters
- Returns
Name of the file
- Return type
- Raises
JobScriptCreationError – An error occurred in creation of job script
-
sdpbenchmarks.utils.
write_slurm_job_file
(conf_slurm)[source]¶ This method writes a SLURM job file to submit with sbatch
- Parameters
conf_slurm (dict) – A dict containing all SLURM job parameters
- Returns
Name of the file
- Return type
- Raises
JobScriptCreationError – An error occurred in creation of job script
-
sdpbenchmarks.utils.
write_tgcc_job_file
(conf_slurm)[source]¶ This method writes a SLURM job file for TGCC Irene machine to submit with ccc_msub
- Parameters
conf_slurm (dict) – A dict containing all SLURM job parameters
- Returns
Name of the file
- Return type
- Raises
JobScriptCreationError – An error occurred in creation of job script
Exceptions¶
This module contains the custom exceptions defined for benchmark suite.
-
exception
sdpbenchmarks.exceptions.
ImagingIOTestError
[source]¶ Error in Imaging IO test benchmakr run
-
exception
sdpbenchmarks.exceptions.
JobScriptCreationError
[source]¶ Error in generating the job script to submit
SDP Performance Metric Monitoring Tool¶
The following sections provide the context, prerequisites and usage of the cpu metric monitoring toolkit.
About¶
The aim of this toolkit is to monitor CPU related performance metrics for SDP pipelines/workflows in a standardised way. Often different HPC clusters have different ways to monitor and report performance related metrics. We will have to adopt our scripts to each machine to be able to extract this data. This toolkit address this gap by providing an automatic and standardised way to collect and report performance metrics. As of now, the toolkit can collect both system wide and job related metrics during the job execution on all the nodes in a multi-node job, save them to the disk (in JSON and excel formats) and generate a job report with plots from different metrics.
Idea¶
As submitting and controlling jobs on HPC machines are often realised by batch schedulers, this toolkit is based on workload managers. Along with SLURM, one of the commonly used batch scheduler in the HPC community, the toolkit can handle PBS and OAR schedulers. SLURM’s scontrol listpids
command gives the Process IDs (pids) of different job steps. Similarly, OAR and PBS provides tools to capture PIDs of jobs. By getting the pid of the main step job, we can monitor different performance metrics by using combination of python’s psutil
package, proc files and perf stat
commands. The toolkit is developed in Python.
Available metrics¶
Currently, the toolkit reports following data or metrics:
Hardware and software metadata of all the compute nodes in the reservation.
CPU related metrics like CPU usage, memory consumption, system-wide network I/O traffic, Infiniband traffic (if supported), meta data of the processes, etc.
perf
events like hardware and software events, hardware cache events and different types of FLOP counts.
All these metrics are gathered and saved in JSON and/or excel formats for easy readability.
Prerequisites¶
The following prerequisites must be installed to use monitoring toolkit:
python >= 3.7
git
Installation¶
Currently, the way to install this toolkit is to git clone the repository and then install it.
To set up the repository and get configuration files:
git clone https://gitlab.com/ska-telescope/platform-scripts
cd ska-sdp-monitor-cpu-metrics
To install all the required python modules
pip3 install --user -r requirements.txt
And finally, install the package using
python3 setup.py install
Another way is to use --editable
option of pip
installation as follows:
pip install "--editable=git+https://gitlab.com/ska-telescope/platform-scripts.git@master#egg=ska-sdp-monitor-metrics&subdirectory=ska-sdp-monitor-cpu-metrics"
This command clones the git repository and runs python3 setup.py install
. This line can be directly added to the conda
environment files.
Usage¶
As stated in the introduction, currently the toolkit is made to work with SLURM and OAR job reservations. The main script to run the benchmark suite is sdpmonitormetrics
. The launch script has following options:
Monitor CPU and Performance metrics for SDP Workflows
optional arguments:
-h, --help show this help message and exit
-d [], --save_dir [] Directory where metrics will be saved. This directory should be available
from all compute nodes. Default is $PWD/job-metrics-$jobID.
-f [PREFIX], --prefix [PREFIX] Prefix to add to the exported files
-i [SAMPLING_FREQ], --sampling_freq [SAMPLING_FREQ] Sampling interval to collect metrics. Default value is 30 seconds.
-c [CHECK_POINT], --check_point [CHECK_POINT] Checking point time interval. Default value is 900 seconds.
-p, --perf_metrics Collect perf metrics. Default metrics that will be collected are SW/HW meta
data and CPU metrics.
-r, --gen_report Generate plots and job report
-e, --export_xlsx Export results to excel file with job ID as sheet name
-b, --export_db Export results to a SQL database
-v, --verbose Enable verbose mode. Display debug messages.
--version show program's version number and exit
------------------------------------------------------------
Arguments¶
The option --save_dir
specifies the folder where results are saved. It is important that this folder should be accessible from all nodes in the SLURM reservation. Typically, NFS mounted home directories can be used for this directory. The --sampling_freq
option tells the toolkit how frequently it should poll for collecting metrics. The default value is 30 sec. The more often we collect the metrics, the more overhead the toolkit will have on the system usage. By default, the toolkit only collects software/hardware metadata and CPU related metrics. If we want perf stat
metrics, we should give -p
option on the CLI. The toolkit is capable of check pointing the data and the time period between check points can be configured using --check_point
flag. If the user wants to generate a job report with plots from different metrics, -r
option must be passed on the CLI.
We can also ask for metric data in excel sheet by passing --export_xlsx
flag. Typically, this file will be updated between different runs with job ID as sheet name. Thus, this excel file will have all the results in one place to ease the process of plotting. Similarly, --export_db
flag tells the toolkit to export the metric data into a SQL database. The tables in SQL data base are named using the convention cpu_metrics_<job_id>
for CPU metrics and perf_metrics_<job_id>
for perf metrics, where <job_id>
is the ID of the batch job.
The toolkit runs in silent mode, where all the stdout is logged to a log file. This is done to not to interfere with the main job step stdout. Typically, the log file can be found in $SAVE_DIR/ska_sdp_monitoring_metrics.log
.
Monitored metrics¶
Software and hardware metadata¶
Currently, the toolkit reports the software versions of docker, singularity, python, OpenMPI and OS. For the hardware metadata, we parse the output of linux command lscpu
to report several informations. In addition, information about system memory is also reported.
Perf stat metrics¶
Perf stat metrics are monitored by executing
perf stat -e <event_list> -p <process_pids> sleep <collection_time>
Currently only broadwell and skylake chips are supported. More intel micro architectures and also AMD ones will be added to the toolkit. Note that the supported perf events differ for different micro architectures and so, not all the listed events might be available for all the cases.
Hardware events:
cycles
instructions
cache-misses
cache-references
branches
branch-misses
Software events:
context-switches
cpu-migrations
Caches:
L2 cache bandwidth
L3 cache bandwidth
FLOPS:
Single precision FLOPS
Double precision FLOPS
Hardware and software events are named perf events in perf stat
and available in both Intel and AMD chips. The cache bandwidths and FLOPS have processor specific event codes. These events are taken from likwid project. Most of these events are claimed to be tested on different processors from the project maintainers.
Note
Along with the raw counter numbers, derived counters are also provided in the metric data. FLOPS are provided in MFLOPS/second, whereas bandwidths are provided in MB/s.
Example use cases¶
Typical use case is shown as follows:
#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH -J sdp-metrics-test
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --no-requeue
#SBATCH --exclusive
#SBATCH --output="slurm-%J.out"
WORK_DIR=/path/to/matmul/executable
MON_DIR=/path/to/ska-sdp-montor-cpu-metrics
# Make sure we have the right working directory
cd $WORK_DIR
echo -e "JobID: $SLURM_JOB_ID\n======"
echo "Time: `date`"
echo "Running on master node: `hostname`"
echo "Current directory: `pwd`"
srun -n $SLURM_JOB_NUM_NODES --ntasks-per-node 1 sdpmonitormetrics &
# mpirun --map-by node -np $SLURM_JOB_NUM_NODES sdpmonitormetrics &
mpirun -np ${SLURM_JOB_NUM_NODES} ./matmul 2000
wait
This simple SLURM script reserves two nodes and runs matrix multiplication using mpirun
. Now looking at the line immediately preceding mpirun
we notice that we are running sdpmonitormetrics
script using srun
as a background process. srun
launches the sdpmonitormetrics
script on all nodes in the reservation, where it runs in the background. The first step the script does is to get the process pid of the main step job (in this case mpirun -np ${SLURM_JOB_NUM_NODES} ./matmul 2000
) and collects the metrics for this process and its child. Once the process is terminated, the script does some post processing to merge all the results, make plots and generate report. It is important to have a wait
command after the main job, else the toolkit script wont be able to do post-processing and save the results. The main step job can be launched with either mpirun
or srun
. Similarly, the toolkit can be launched with either of them.
Sometimes, processes will not tear down cleanly even after the main job has finished. For example, this case can arise when dask is used as a parallelisation framework and scheduler is not stopped after the main job. The toolkit monitors the process id of the main job and keeps monitoring till it is killed. So, in this situation will keep monitoring till the end of reservation time. To avoid this issue, we can use a Inter Process Communicator (IPC) using a simple file. After the main job, we can add a line echo "FINISHED" > .ipc-$SLURM_JOB_ID
and the toolkit keeps reading this .ipc-$SLURM_JOB_ID
file and once it reads FINISHED
, it will stop monitoring. This is very simple and portable solution for this kind of problem. Also, we are adding a wait
command, the SLURM job will wait till the end of the reservation period in this case. To avoid such condition, we can wait for exclusively only monitor job by capturing its PID.
#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH -J sdp-metrics-test
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --no-requeue
#SBATCH --exclusive
#SBATCH --output="slurm-%J.out"
WORK_DIR=/path/to/matmul/executable
MON_DIR=/path/to/ska-sdp-montor-cpu-metrics
# Make sure we have the right working directory
cd $WORK_DIR
echo -e "JobID: $SLURM_JOB_ID\n======"
echo "Time: `date`"
echo "Running on master node: `hostname`"
echo "Current directory: `pwd`"
srun -n $SLURM_JOB_NUM_NODES --ntasks-per-node 1 sdpmonitormetrics &
export MON_PID=$!
# mpirun --map-by node -np $SLURM_JOB_NUM_NODES sdpmonitormetrics &
mpirun -np ${SLURM_JOB_NUM_NODES} ./matmul 2000
echo "FINISHED" > .ipc-$SLURM_JOB_ID
wait $MON_PID
This sample script shows how to use the toolkit for dask jobs.
#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH -J sdp-metrics-test
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --mail-type=FAIL
#SBATCH --no-requeue
#SBATCH --exclusive
#SBATCH --output="slurm-%J.out"
MON_DIR=/path/to/ska-sdp-montor-cpu-metrics
SCHEFILE=$PWD/${SLURM_JOB_ID}.dasksche.json
WORKSPACE=$PWD/dask-worker-space
rm -rf $SCHEFILE
rm -rf $WORKSPACE
export DASK_SCHEDULER_FILE="$SCHEFILE"
#! Set up python
echo -e "Running python: `which python`"
echo -e "Running dask-scheduler: `which dask-scheduler`"
cd $SLURM_SUBMIT_DIR
echo -e "Changed directory to `pwd`.\n"
JOBID=${SLURM_JOB_ID}
echo ${SLURM_JOB_NODELIST}
scheduler=$(scontrol show hostnames $SLURM_JOB_NODELIST | uniq | head -n1)
echo "run dask-scheduler"
ssh ${scheduler} python3 `which dask-scheduler` --port=8786 --scheduler-file=$SCHEFILE &
sleep 5
echo "Monitoring script"
srun -n $SLURM_JOB_NUM_NODES --ntasks-per-node 1 python3 sdpmonitormetrics &
export MON_PID=$!
echo "run dask-worker"
srun -n ${SLURM_JOB_NUM_NODES} python3 `which dask-worker` --nanny --nprocs 4 --interface ib0 --nthreads 1\
--memory-limit 200GB --scheduler-file=$SCHEFILE ${scheduler}:8786 &
echo "Scheduler and workers now running"
#! We need to tell dask Client (inside python) where the scheduler is running
echo "Scheduler is running at ${scheduler}"
CMD="python3 src/cluster_dask_test.py ${scheduler}:8786 | tee cluster_dask_test.log"
eval $CMD
echo "FINISHED" > .ipc-$SLURM_JOB_ID
wait $MON_PID
The above script monitors the dask workers. Note that dask workers and scheduler should be teared down cleanly for this approach to work. If not, use the approach provided in the above example to wait for monitor job by capturing its PID.
These scripts are the source file for matrix multiplication is available in the repository for testing purposes in ska-sdp-monitor-cpu-metrics/tests
folder.
In the case of PBS jobs, we should do a little hack for the toolkit to work. We have not tested the toolkit on production ready PBS cluster. From the local tests, it is found that the environment variable PBS_NODEFILE
is only available on the first node in the reservation. We need this file to be accessible from all nodes for the toolkit to work properly. So, the hack is to copy this nodefile to the local directory (which is often NFS mounted home directory where all nodes can access) and set a new environment variable called PBS_NODEFILE_LOCAL
and export to all nodes. Now the toolkit looks for this variable and reads node list from this variable. This can be done in following way:
#!/bin/bash
#PBS -N metrics-test
#PBS -V
#PBS -j oe
#PBS -k eod
#PBS -q workq
#PBS -l walltime=01:00:00
#PBS -l select=2:ncpus=6:mpiprocs=12
cd /home/pbsuser
# We need to copy the nodefile to CWD as it is not available from all compute nodes in the reservation
cp $PBS_NODEFILE nodefile
# Later we export a 'new' env variable PBS_NODEFILE_LOCAL using mpirun to the location of copied local nodefile
mpirun --map-by node -np 2 -x PBS_NODEFILE_LOCAL=$PWD/nodefile sdpmonitormetrics -i 5 -v -r -e &
sleep 2
mpirun --map-by node -np 2 ./matmul 1500
wait
Output files¶
Upon successful completion of the job and monitoring task, we will find following files inside the metrics directory that is created by the toolkit.
job-metrics-{job-id}
├── data
│ ├── cpu_metrics.json
│ ├── meta_data.json
│ └── perf_metrics.json
├── job-report-{job-id}.pdf
└── plots
├── bytes_recv_per_node.png
├── bytes_recv_total.png
├── bytes_sent_per_node.png
├── bytes_sent_total.png
├── core_power_per_node.png
├── core_power_total.png
├── cpu_percent_sys_average.png
├── cpu_percent_sys_per_node.png
├── dram_power_per_node.png
├── dram_power_total.png
├── memory_bw_average.png
├── memory_bw_per_node.png
├── package_power_per_node.png
├── package_power_total.png
├── packets_recv_per_node.png
├── packets_recv_total.png
├── packets_sent_per_node.png
├── packets_sent_total.png
├── port_rcv_data_per_node.png
├── port_rcv_data_total.png
├── port_rcv_packets_per_node.png
├── port_rcv_packets_total.png
├── port_xmit_data_per_node.png
├── port_xmit_data_total.png
├── port_xmit_packets_per_node.png
├── port_xmit_packets_total.png
├── uncore_power_per_node.png
├── uncore_power_total.png
├── uss_average.png
└── uss_per_node.png
Typically, job-id
is job ID of the SLURM job, node-0-hostname
is the hostname of first node in the reservation and so on. The JSON files meta_data.json
and cpu_metrics.json
have metric data from all the hosts. Folder raw_node
contains same metrics but for each node separately. All the generated plots of the metrics are placed in plots
folder. Finally, a job report report-{job-id}.pdf
is generated will all the plots included. If the export to excel option is asked, excel files are also generated and placed in the save directory.
The schema for the cpu_metrics.json
file is shown as follows:
{
"type": "object",
"required": [],
"properties": {
"host_names": {
"type": "array",
"items": {
"type": "string"
}
},
"node-0-hostname": {
"type": "object",
"required": [],
"properties": {
"child_proc_md": {
"type": "array",
"items": {
"type": "string"
}
},
"cpu_percent": {
"type": "array",
"items": {
"type": "number"
}
},
"cpu_percent_sys": {
"type": "array",
"items": {
"type": "number"
}
},
"cpu_time": {
"type": "array",
"items": {
"type": "number"
}
},
"ib_io_counters": {
"type": "object",
"required": [],
"properties": {
"port_rcv_data": {
"type": "array",
"items": {
"type": "number"
}
},
"port_rcv_packets": {
"type": "array",
"items": {
"type": "number"
}
},
"port_xmit_data": {
"type": "array",
"items": {
"type": "number"
}
},
"port_xmit_packets": {
"type": "array",
"items": {
"type": "number"
}
}
}
},
"io_counters": {
"type": "object",
"required": [],
"properties": {
"read_bytes": {
"type": "array",
"items": {
"type": "number"
}
},
"read_count": {
"type": "array",
"items": {
"type": "number"
}
},
"write_bytes": {
"type": "array",
"items": {
"type": "number"
}
},
"write_count": {
"type": "array",
"items": {
"type": "number"
}
}
}
},
"memory_full_info": {
"type": "object",
"required": [],
"properties": {
"swap": {
"type": "array",
"items": {
"type": "string"
}
},
"uss": {
"type": "array",
"items": {
"type": "number"
}
}
}
},
"memory_info": {
"type": "object",
"required": [],
"properties": {
"rss": {
"type": "array",
"items": {
"type": "number"
}
},
"shared": {
"type": "array",
"items": {
"type": "number"
}
},
"vms": {
"type": "array",
"items": {
"type": "number"
}
}
}
},
"memory_percent": {
"type": "array",
"items": {
"type": "number"
}
},
"net_io_counters": {
"type": "object",
"required": [],
"properties": {
"bytes_recv": {
"type": "array",
"items": {
"type": "number"
}
},
"bytes_sent": {
"type": "array",
"items": {
"type": "number"
}
},
"packets_recv": {
"type": "array",
"items": {
"type": "number"
}
},
"packets_sent": {
"type": "array",
"items": {
"type": "number"
}
}
}
},
"num_fds": {
"type": "array",
"items": {
"type": "number"
}
},
"num_threads": {
"type": "array",
"items": {
"type": "number"
}
},
"parent_proc_md": {
"type": "object",
"required": [],
"properties": {}
},
"rapl_powercap": {
"type": "object",
"required": [],
"properties": {
"core-0": {
"type": "array",
"items": {
"type": "number"
}
},
"uncore-0": {
"type": "array",
"items": {
"type": "number"
}
},
"dram-0": {
"type": "array",
"items": {
"type": "number"
}
},
"package-0": {
"type": "array",
"items": {
"type": "number"
}
}
}
},
"time_stamps": {
"type": "array",
"items": {
"type": "number"
}
}
}
},
"sampling_frequency": {
"type": "number"
}
}
where the field host_names
contains all the names of the nodes in the SLURM reservation. The CPU metric data is organised for each host separately, where data for field node-0-hostname
corresponds to data for node-0
in the reservation and so on. The perf
metrics data is also organised in a similar way.
For example, if we want to inspect the memory consumption in percentage on, say example-host-0
node, we can query it simply as cpu_metrics['example-host-0']['memory_percent']
in python. This gives us list of values for each timestamp given in cpu_metrics['example-host-0']['timestamps']
. Note that timestamps for different hosts are saved separately as there can be synchronisation issues between different nodes in the cluster. It is also worth noting that integer timestamps are used and so, monitoring with a frequency of less than a second is not possible.
API Documentation¶
The following sections provide the API documentation of different files of the benchmark suite.
Set up monitoring¶
This file does pre-processing steps of the metric data collection
SW and HW metadata¶
This file contains the class to extract software and hardware metadata
Monitor metrics¶
This file initiates the script to extract real time perf metrics
CPU metrics¶
This file initiates the script to extract real time cpu metrics
-
class
monitormetrics.cpumetrics.cpumetrics.
MonitorCPUMetrics
(config)[source]¶ Engine to monitor cpu related metrics
-
check_availability_ib_rapl_membw
()[source]¶ This method checks if infiniband and RAPL metrics are available
-
check_metric_data
()[source]¶ This method checks if all the metric data is consistent with number of timestamps
-
get_cpu_time_from_parent_and_childs
()[source]¶ This method gets cumulative CPU time from parent and its childs
-
get_cumulative_metric_value
(metric_type)[source]¶ This method gets cumulative metric account for all childs for a given metric type
-
Perf metrics¶
This file initiates the script to extract real time perf metrics
-
class
monitormetrics.cpumetrics.perfmetrics.
MonitorPerfEventsMetrics
(config)[source]¶ Engine to extract performance metrics
-
compute_derived_metrics
()[source]¶ This method computes all the derived metrics from parsed perf counters
-
get_list_of_pids
()[source]¶ This method gets the list of pids to monitor by adding children pids to parents
-
initialise_perf_metrics_data_dict
()[source]¶ This method initialises the perf metric related parameters
-
static
match_perf_line
(pattern, cmd_out)[source]¶ This method builds perf output pattern and get matching groups
-
parse_perf_cmd_out
(cmd_out)[source]¶ This method parses perf command output and populate perf data dict with counter values
-
Post monitoring¶
Utility Functions¶
This module contains utility functions for gathering CPU metrics
-
class
monitormetrics.utils.utils.
FileLock
(protected_file_path, timeout=None, delay=1, lock_file_contents=None)[source]¶ A file locking mechanism that has context-manager support so you can use it in a
with
statement. This should be relatively cross compatible as it doesn’t rely onmsvcrt
orfcntl
for the locking.-
acquire
(blocking=True)[source]¶ Acquire the lock, if possible. If the lock is in use, and blocking is False, return False. Otherwise, check again every self.delay seconds until it either gets the lock or exceeds timeout number of seconds, in which case it raises an exception.
-
-
class
monitormetrics.utils.utils.
PDF
(config)[source]¶ custom PDF class that inherits from the FPDF
This method defines footer of the pdf
-
monitormetrics.utils.utils.
check_perf_events
(perf_events)[source]¶ This function check if all perf groups are actually working. We will only probe the working counters during monitoring
-
monitormetrics.utils.utils.
dump_json
(content, filename)[source]¶ This function appends data to an existing json content. It creates a new file if no existing file found.
-
monitormetrics.utils.utils.
execute_cmd
(cmd_str, handle_exception=True)[source]¶ Accept command string and returns output.
- Parameters
- Returns
Output of the command. If command execution fails, returns ‘not_available’
- Return type
- Raises
subprocess.CalledProcessError – An error occurred in execution of command iff handle_exception is set to False
-
monitormetrics.utils.utils.
execute_cmd_pipe
(cmd_str)[source]¶ Accept command string and execute it using piping and returns process object.
-
monitormetrics.utils.utils.
find_procs_by_name
(name)[source]¶ Return a list of processes matching ‘name’
-
monitormetrics.utils.utils.
get_cpu_model_family
(cpu)[source]¶ ” This function gives CPU model and family ids from CPUID instruction
-
monitormetrics.utils.utils.
get_cpu_model_names_for_non_x86
()[source]¶ This function tries to extract the vendor, model and cpu architectures for non x86 machines like IBM POWER, ARM
- Returns
Name of the vendor model name/number of the processor micro architecture of the processor
- Return type
-
monitormetrics.utils.utils.
get_cpu_vendor
(cpu)[source]¶ This function gets the vendor name from CPUID instruction
-
monitormetrics.utils.utils.
get_cpu_vendor_model_family
()[source]¶ This function gets the name of CPU vendor, family and model parsed from CPUID instruction
- Returns
Name of the vendor, CPU family and model ID
- Return type
-
monitormetrics.utils.utils.
get_mem_bw_event
()[source]¶ This function returns the perf event to get memory bandwidth
- Returns
A string to get memory bandwidth for perf stat command
- Return type
-
monitormetrics.utils.utils.
get_perf_events
()[source]¶ This function checks the micro architecture type and returns available perf events. Raises an exception if micro architecture is not implemented
- Returns
Perf events with event name dict: Derived perf metrics from event counters
- Return type
- Raises
PerfEventsNotFoundError – An error occurred while looking for perf events
-
monitormetrics.utils.utils.
get_rapl_devices
()[source]¶ This function gets all the packages, core, uncore and dram device available within RAPL powercap interface
- Returns
A dict with package names and paths
- Return type
-
monitormetrics.utils.utils.
get_value
(input_dict, target)[source]¶ Find the value for a given target in dict
-
monitormetrics.utils.utils.
ibstat_ports
()[source]¶ This function returns Infiniband ports if present
- Returns
A dict with IB port names and numbers
- Return type
-
monitormetrics.utils.utils.
load_json
(filename)[source]¶ This function loads json file and return dict
-
monitormetrics.utils.utils.
merge_dicts
(exst_dict, new_dict)[source]¶ Merge two dicts. old_content is updated with data from new_content
-
monitormetrics.utils.utils.
proc_if_running
(procs)[source]¶ Check if all processes are running and returns a False if all of them are terminated
Processor specific data¶
This file contains processor specific information like model names, families and perf events
-
monitormetrics.utils.processorspecific.
cpu_micro_architecture_name
(vendor_name, model_id, family_id)[source]¶ This function gives the name of the micro architecture based on CPU model and family IDs
- Parameters
vendor_name (str) – Name of the vendor
model_id – Model ID of the CPU
family_id – Family ID of the cpu
- Returns
Name of the micro architecture
- Return type
- Raises
ProcessorVendorNotFoundError – An error occurred while looking for processor vendor.
KeyNotFoundError – An error occurred while looking for micro architecture
-
monitormetrics.utils.processorspecific.
llc_cache_miss_perf_event
(processor_vendor, micro_architecture)[source]¶ This function gives the event code and umask for LLC cache miss event for different architectures
- Parameters
- Returns
String containing event code and umask
- Return type
- Raises
ProcessorVendorNotFoundError – An error occurred while looking for processor vendor.
Exceptions¶
This file contains the custom exceptions defined for monitoring tools.
-
exception
monitormetrics.utils.exceptions.
BatchSchedulerNotFound
[source]¶ Batch scheduler not implemented or not recognised
-
exception
monitormetrics.utils.exceptions.
CommandExecutionFailed
[source]¶ Command execution exception