Usage

As stated in the introduction, currently the toolkit is made to work with SLURM, PBS and OAR job reservations. The main script to run the toolkit is perfmon. The launch script has following options:

usage: perfmon [-h] [-d [SAVE_DIR]] [-p [PREFIX]] [-l [LAUNCHER]] [-i SAMPLING_FREQ] [-c [CHECK_POINT]] [-r] [-e {csv,hdf5,parquet,pickle,feather,orc} [{csv,hdf5,parquet,pickle,feather,orc} ...]] [--system [SYSTEM]] [--partition [PARTITION]]
             [--env [ENV]] [--name [NAME]] [-v]

optional arguments:
-h, --help            show this help message and exit
-d [SAVE_DIR], --save_dir [SAVE_DIR]
                      Base directory where metrics will be saved. This directory should
                      be available from all compute nodes. Default is $PWD
-p [PREFIX], --prefix [PREFIX]
                      Name of the directory to be created to save metric data. If provided,
                      metrics will be located at $SAVE_DIR/$PREFIX.
-l [LAUNCHER], --launcher [LAUNCHER]
                      Launcher used to launch mpi tasks
-i SAMPLING_FREQ, --sampling_freq SAMPLING_FREQ
                      Sampling interval to collect metrics. Default value is 30 seconds
-c [CHECK_POINT], --check_point [CHECK_POINT]
                      Checking point time interval. Default value is 900 seconds
-r, --gen_report      Generate plots and job report
-e {csv,hdf5,parquet,pickle,feather,orc} [{csv,hdf5,parquet,pickle,feather,orc} ...], --export {csv,hdf5,parquet,pickle,feather,orc} [{csv,hdf5,parquet,pickle,feather,orc} ...]
                      Export results to different file formats
--system [SYSTEM]     Name of the system (only when used with SDP Benchmark tests)
--partition [PARTITION]
                      Name of the partition (only when used with SDP Benchmark tests)
--env [ENV]           Name of the environment (only when used with SDP Benchmark tests)
--name [NAME]         Name of the test (only when used with SDP Benchmark tests)
-v, --verbose         Enable verbose mode. Display debug messages

Arguments

  • The option --save_dir specifies the folder where results are saved. It is important that this folder should be accessible from all nodes in the reservation. Typically, NFS mounted home directories can be used for this directory. The option prefix can be used to create a sub directory within $SAVE_DIR where actual metrics will be saved. For instance, job ID can be used as prefix so that multiple job metrics will be located under same $SAVE_DIR. If --prefix is used, metrics can be found at $SAVE_DIR/$PREFIX. If no prefix option is passed, toolkit will place all metrics under $SAVE_DIR directory.

  • The option launcher can be used to specify which mpi wrapper is used to launch parallel jobs. if not specified, toolkit will try to identify the launcher.

  • The --sampling_freq option tells the toolkit how frequently it should poll for collecting metrics. The default value is 30 sec. The more often we collect the metrics, the more overhead the toolkit will have on the system usage. By default, the toolkit only collects hardware metadata, CPU related and perf stat metrics.

  • The toolkit is capable of check pointing the data and the time period between check points can be configured using --check_point flag.

  • If the user wants to generate a job report with plots from different metrics, -r or --gen_report option must be passed on the CLI.

  • We can also export metric data in different formats using --export flag. Currently, toolkit is capable of exporting data in CSV, pickle, HDF5, Parquet, Feather and ORC formats.

The toolkit runs in silent mode, where all the stdout is logged to a log file. This is done to not to interfere with the main job step stdout. Typically, the log file can be found in $SAVE_DIR/ska_sdp_monitoring_metrics.log.

GPU metrics

The toolkit automatically detects for the presence of NVIDIA GPUs and if found, it extracts the metrics irrespective of batch job is using GPUs or not. In order to detect the GPUs, nvidia-smi command is executed and return status is checked.

Note

At the moment, only NVIDIA GPUs are supported. The toolkit will not be able to monitor metrics for other types of GPUs.

Monitored metrics

Hardware metadata

Currently, for the hardware metadata, we parse the output of linux command lscpu to report several informations. In addition, information about system memory is also reported. In the case of NVIDIA GPUs, several GPU related infos available from NVML library is used.

Perf stat metrics

Perf stat metrics are monitored by executing

perf stat -e <event_list> -p <process_pids> sleep <collection_time>

Currently only Broadwell, Haswell, SkyLake, SandyBridge, Zen and Zen3 chips are supported. More intel micro architectures and also AMD ones will be added to the toolkit. Note that the supported perf events differ for different micro architectures and so, not all the listed events might be available for all the cases.

Hardware events:

  • cycles

  • instructions

  • cache-misses

  • cache-references

  • branches

  • branch-misses

Software events:

  • context-switches

  • cpu-migrations

Caches:

  • L2 cache bandwidth

  • L3 cache bandwidth

FLOPS:

  • Single precision FLOPS

  • Double precision FLOPS

Hardware and software events are named perf events in perf stat and available in both Intel and AMD chips. The cache bandwidths and FLOPS have processor specific event codes. These events are taken from likwid project. Most of these events are claimed to be tested on different processors from the project maintainers.

Note

Along with the raw counter numbers, derived counters are also provided in the metric data. FLOPS are provided in MFLOPS/second, whereas bandwidths are provided in MB/s.

NVIDIA GPU metrics

As stated before, NVML library is used to query for several device metrics for NVIDIA GPU cards. The reported metrics are as follows:

  • Clock frequency info for Graphics, SM and memory

  • Error Correcting Code (ECC) counts for Single and double precision

  • Usage of GPU and BAR1 memory

  • Device temperature, fan speed (if exists), number of processes

  • PCI expresses send and receive throughputs

  • Power usage and GPU throttling time due to power and thermal constraints

  • GPU and memory utilization rates

All these metrics are reported for each GPU separately. A prefix of form gpu-{num} is added to host name of each node, where {num} is the GPU device number to differentiate different GPUs.

Example use cases

Typical use case is shown as follows:

#!/bin/bash

#SBATCH --time=00:30:00
#SBATCH -J sdp-metrics-test
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --no-requeue
#SBATCH --exclusive
#SBATCH --output="slurm-%J.out"

WORK_DIR=/path/to/matmul/executable

# Make sure we have the right working directory
cd $WORK_DIR

echo -e "JobID: $SLURM_JOB_ID\n======"
echo "Time: `date`"
echo "Running on master node: `hostname`"
echo "Current directory: `pwd`"

srun -n $SLURM_JOB_NUM_NODES --ntasks-per-node 1 perfmon &
# mpirun --map-by node -np $SLURM_JOB_NUM_NODES perfmon &
mpirun -np ${SLURM_JOB_NUM_NODES} ./matmul 2000

wait

This simple SLURM script reserves two nodes and runs matrix multiplication using mpirun. Now looking at the line immediately preceding mpirun we notice that we are running perfmon script using srun as a background process. srun launches the perfmon script on all nodes in the reservation, where it runs in the background. The first step the script does is to get the process pid of the main step job (in this case mpirun -np ${SLURM_JOB_NUM_NODES} ./matmul 2000) and collects the metrics for this process and its child. Once the process is terminated, the script does some post processing to merge all the results, make plots and generate report. It is important to have a wait command after the main job, else the toolkit script wont be able to do post-processing and save the results. The main step job can be launched with either mpirun or srun. Similarly, the toolkit can be launched with either of them.

Sometimes, processes will not tear down cleanly even after the main job has finished. For example, this case can arise when dask is used as a parallelisation framework and scheduler is not stopped after the main job. The toolkit monitors the process id of the main job and keeps monitoring till it is killed. So, in this situation will keep monitoring till the end of reservation time. To avoid this issue, we can use a file based Inter Process Communicator (IPC). After the main job, we can add a line echo "FINISHED" > .ipc-$SLURM_JOB_ID and the toolkit keeps reading this .ipc-$SLURM_JOB_ID file and once it reads FINISHED, it will stop monitoring. This is very simple and portable solution for this kind of problem. Also, we are adding a wait command, the SLURM job will wait till the end of the reservation period in this case. To avoid such condition, we can wait for exclusively only monitor job by capturing its PID.

#!/bin/bash

#SBATCH --time=00:30:00
#SBATCH -J sdp-metrics-test
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --no-requeue
#SBATCH --exclusive
#SBATCH --output="slurm-%J.out"

WORK_DIR=/path/to/matmul/executable
MON_DIR=/path/to/ska-sdp-montor-cpu-metrics

# Make sure we have the right working directory
cd $WORK_DIR

echo -e "JobID: $SLURM_JOB_ID\n======"
echo "Time: `date`"
echo "Running on master node: `hostname`"
echo "Current directory: `pwd`"

srun -n $SLURM_JOB_NUM_NODES --ntasks-per-node 1 perfmon &
export MON_PID=$!
# mpirun --map-by node -np $SLURM_JOB_NUM_NODES perfmon &
mpirun -np ${SLURM_JOB_NUM_NODES} ./matmul 2000
echo "FINISHED" > .ipc-$SLURM_JOB_ID

wait $MON_PID

This sample script shows how to use the toolkit for dask jobs.

#!/bin/bash

#SBATCH --time=00:30:00
#SBATCH -J sdp-metrics-test
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --mail-type=FAIL
#SBATCH --no-requeue
#SBATCH --exclusive
#SBATCH --output="slurm-%J.out"

MON_DIR=/path/to/ska-sdp-montor-cpu-metrics

SCHEFILE=$PWD/${SLURM_JOB_ID}.dasksche.json
WORKSPACE=$PWD/dask-worker-space

rm -rf $SCHEFILE
rm -rf $WORKSPACE

export DASK_SCHEDULER_FILE="$SCHEFILE"

#! Set up python
echo -e "Running python: `which python`"
echo -e "Running dask-scheduler: `which dask-scheduler`"

cd $SLURM_SUBMIT_DIR
echo -e "Changed directory to `pwd`.\n"

JOBID=${SLURM_JOB_ID}
echo ${SLURM_JOB_NODELIST}

scheduler=$(scontrol show hostnames $SLURM_JOB_NODELIST | uniq | head -n1)

echo "run dask-scheduler"
ssh ${scheduler} python3 `which dask-scheduler` --port=8786 --scheduler-file=$SCHEFILE  &

sleep 5

echo "Monitoring script"
srun -n $SLURM_JOB_NUM_NODES --ntasks-per-node 1 perfmon &
export MON_PID=$!

echo "run dask-worker"
srun -n ${SLURM_JOB_NUM_NODES} python3 `which dask-worker` --nanny --nprocs 4 --interface ib0 --nthreads 1\
--memory-limit 200GB --scheduler-file=$SCHEFILE ${scheduler}:8786 &

echo "Scheduler and workers now running"

#! We need to tell dask Client (inside python) where the scheduler is running
echo "Scheduler is running at ${scheduler}"

CMD="python3 src/cluster_dask_test.py ${scheduler}:8786 | tee cluster_dask_test.log"

eval $CMD
echo "FINISHED" > .ipc-$SLURM_JOB_ID

wait $MON_PID

The above script monitors the dask workers. Note that dask workers and scheduler should be teared down cleanly for this approach to work. If not, use the approach provided in the above example to wait for monitor job by capturing its PID.

These scripts are the source file for matrix multiplication is available in the repository for testing purposes in ska-sdp-perfmon/tests folder.

In the case of PBS jobs, we should do a little hack for the toolkit to work. We have not tested the toolkit on production ready PBS cluster. From the local tests, it is found that the environment variable PBS_NODEFILE is only available on the first node in the reservation. We need this file to be accessible from all nodes for the toolkit to work properly. So, the hack is to copy this nodefile to the local directory (which is often NFS mounted home directory where all nodes can access) and set a new environment variable called PBS_NODEFILE_LOCAL and export to all nodes. Now the toolkit looks for this variable and reads node list from this variable. This can be done in following way:

#!/bin/bash

#PBS -N metrics-test
#PBS -V
#PBS -j oe
#PBS -k eod
#PBS -q workq
#PBS -l walltime=01:00:00
#PBS -l select=2:ncpus=6:mpiprocs=12

cd /home/pbsuser

# We need to copy the nodefile to CWD as it is not available from all compute nodes in the reservation
cp $PBS_NODEFILE nodefile

# Later we export a 'new' env variable PBS_NODEFILE_LOCAL using mpirun to the location of copied local nodefile
mpirun --map-by node -np 2 -x PBS_NODEFILE_LOCAL=$PWD/nodefile perfmon -i 5 -v -r -e &
sleep 2
mpirun --map-by node -np 2 ./matmul 1500

wait

Output files

Upon successful completion of the job and monitoring task, we will find following files inside the metrics directory that is created by the toolkit if --gen_report flag is enabled. If a --prefix is used to name the directory all the files will be placed under this directory. If not the toolkit places all the files in $SAVE_DIR. Typically, under this folder we will find following sub-directories and files

  • configs/: Contains configuration files of the toolkit used for each node. It will be dumped only when -v flag is enabled.

  • metrics/: This folder contains all the metrics in JSON files.

  • plots/: All the generated plots in png format are placed in this folder

  • job-report-{job-id}.pdf: Job report with all the plots included

  • *.csv, *.h5, *.orc, *parquet, *.feather, *.pkl: Exported files in different formats with metric as name.

The schema for the cpu_metrics.json file is shown as follows:

 {
"type": "object",
"required": [],
"properties": {
  "host_names": {
    "type": "array",
    "items": {
      "type": "string"
    }
  },
  "node-0-hostname": {
    "type": "object",
    "required": [],
    "properties": {
      "child_proc_md": {
        "type": "array",
        "items": {
          "type": "string"
        }
      },
      "cpu_percent": {
        "type": "array",
        "items": {
          "type": "number"
        }
      },
      "cpu_percent_sys": {
        "type": "array",
        "items": {
          "type": "number"
        }
      },
      "cpu_time": {
        "type": "array",
        "items": {
          "type": "number"
        }
      },
      "ib_io_counters": {
        "type": "object",
        "required": [],
        "properties": {
          "port_rcv_data": {
            "type": "array",
            "items": {
              "type": "number"
            }
          },
          "port_rcv_packets": {
            "type": "array",
            "items": {
              "type": "number"
            }
          },
          "port_xmit_data": {
            "type": "array",
            "items": {
              "type": "number"
            }
          },
          "port_xmit_packets": {
            "type": "array",
            "items": {
              "type": "number"
            }
          }
        }
      },
      "io_counters": {
        "type": "object",
        "required": [],
        "properties": {
          "read_bytes": {
            "type": "array",
            "items": {
              "type": "number"
            }
          },
          "read_count": {
            "type": "array",
            "items": {
              "type": "number"
            }
          },
          "write_bytes": {
            "type": "array",
            "items": {
              "type": "number"
            }
          },
          "write_count": {
            "type": "array",
            "items": {
              "type": "number"
            }
          }
        }
      },
      "memory_full_info": {
        "type": "object",
        "required": [],
        "properties": {
          "swap": {
            "type": "array",
            "items": {
              "type": "string"
            }
          },
          "uss": {
            "type": "array",
            "items": {
              "type": "number"
            }
          }
        }
      },
      "memory_info": {
        "type": "object",
        "required": [],
        "properties": {
          "rss": {
            "type": "array",
            "items": {
              "type": "number"
            }
          },
          "shared": {
            "type": "array",
            "items": {
              "type": "number"
            }
          },
          "vms": {
            "type": "array",
            "items": {
              "type": "number"
            }
          }
        }
      },
      "memory_percent": {
        "type": "array",
        "items": {
          "type": "number"
        }
      },
      "net_io_counters": {
        "type": "object",
        "required": [],
        "properties": {
          "bytes_recv": {
            "type": "array",
            "items": {
              "type": "number"
            }
          },
          "bytes_sent": {
            "type": "array",
            "items": {
              "type": "number"
            }
          },
          "packets_recv": {
            "type": "array",
            "items": {
              "type": "number"
            }
          },
          "packets_sent": {
            "type": "array",
            "items": {
              "type": "number"
            }
          }
        }
      },
      "num_fds": {
        "type": "array",
        "items": {
          "type": "number"
        }
      },
      "num_threads": {
        "type": "array",
        "items": {
          "type": "number"
        }
      },
      "parent_proc_md": {
        "type": "object",
        "required": [],
        "properties": {}
      },
      "rapl_powercap": {
        "type": "object",
        "required": [],
        "properties": {
          "core-0": {
            "type": "array",
            "items": {
              "type": "number"
            }
          },
          "uncore-0": {
            "type": "array",
            "items": {
              "type": "number"
            }
          },
          "dram-0": {
            "type": "array",
            "items": {
              "type": "number"
            }
          },
          "package-0": {
            "type": "array",
            "items": {
              "type": "number"
            }
          }
        }
      },
      "time_stamps": {
        "type": "array",
        "items": {
          "type": "number"
        }
      }
    }
  },
  "sampling_frequency": {
    "type": "number"
  }
}

where the field host_names contains all the names of the nodes in the SLURM reservation. The CPU metric data is organised for each host separately, where data for field node-0-hostname corresponds to data for node-0 in the reservation and so on. The perf and GPU metrics data are also organised in a similar way.

For example, if we want to inspect the memory consumption in percentage on, say example-host-0 node, we can query it simply as cpu_metrics['example-host-0']['memory_percent'] in python. This gives us list of values for each timestamp given in cpu_metrics['example-host-0']['timestamps']. Note that timestamps for different hosts are saved separately as there can be synchronisation issues between different nodes in the cluster. It is also worth noting that integer timestamps are used and so, monitoring with a frequency of less than a second is not possible.

Plotting data

Recommended way to plot metric data is to export data into csv or any other available format and load them into Pandas dataframe. Examples can be found in perfmon/common/plots folder on how to plot data using Pandas dataframe.