Overview

The aim of this toolkit is to monitor CPU related performance metrics for SDP pipelines/workflows in a standardised way. Often different HPC clusters have different ways to monitor and report performance related metrics. We will have to adopt our scripts to each machine to be able to extract this data. This toolkit address this gap by providing an automatic and standardised way to collect and report performance metrics. As of now, the toolkit can collect both system wide and job related metrics during the job execution on all the nodes in a multi-node job, export data in various formats and generate a job report with plots from different metrics.

Design

As submitting and controlling jobs on HPC machines are often realised by batch schedulers, this toolkit is based on workload managers. Along with SLURM, one of the commonly used batch scheduler in the HPC community, the toolkit can handle PBS and OAR schedulers. SLURM’s scontrol listpids command gives the Process IDs (pids) of different job steps. Similarly, OAR and PBS provides tools to capture PIDs of jobs. By getting the pid of the main step job, we can monitor different performance metrics by using combination of python’s psutil package, proc files and perf stat commands. The toolkit is developed in Python.

Besides several CPU related metrics, the toolkit reports several performance metrics for NVIDIA GPUs. Python bindings of NVIDIA Management Library (NVML) is used to monitor the metrics.

Available metrics

Currently, the toolkit reports following metrics:

  • Hardware metadata of all the compute nodes in the reservation.

  • CPU related metrics like CPU usage, memory consumption, system-wide network I/O traffic, Infiniband traffic (if supported), etc.

  • perf events like hardware and software events, hardware cache events and different types of FLOP counts.

  • NVIDIA GPU performance metrics.

All these metrics are gathered and exported in different formats (including JSON, CSV, H5 tables).